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Preface 


Synthetic Biology is an emerging field that integrates all biological subsystems with 
the objective of engineering, constructing or modifying biological functions, organelles, 
cellular structures, simple cells and creating whole novel organisms with designed 
properties. This is accomplished by application of engineering principles such as 
hierarchical design, modular parts, isolation of unrelated functions and standard 
interfaces as well as systems biology and all aspects of molecular biology. Synthetic 
Biology extends genetic engineering to focus on whole systems rather than individual 
genes and primary gene products, and will ultimately provide diagnostic tools, novel 
methods for production of therapeutics and strategies for treatment of diseases. Our 
compendium is written for university undergraduates, graduate students, faculty and 
investigators at research institutes. Our Board of 11 Nobel Laureates approved our 
overall approach and content and our selection of articles was validated and enhanced 
by four reviewers from major research institutions. There are 23 peer reviewed articles 
at a length of over 700 pages and as such is the largest in depth, up to date treatment 
presently available. 

The 23 detailed articles organized into six sections: Biological Basis; Modeling; 
Modular Parts and Circuits; Synthetic Genomes; Diseases and Therapeutics; and 
Chemicals Production. In addition, there is an introductory article entitled Synthetic 
Biology. The Biological Basis section defines key areas that support synthetic 
biology approaches including Emergence of the First Cells (Protocells), Regulation 
of Gene Expression, the Interactome and also Microbiomes; the Modeling section 
provides mathematical and computer programming expertise including Dynamics of 
Biomolecular Networks, Computer Simulation of the Cell and the SynBioSS Designer 
Modeling Suite, which support the following section on Modular Parts and Circuits. This 
section covers key practical applications and advances such as Synthetic Gene Networks, 
DNA Origami Nanobots, RNAi Synthetic Logic Circuits for Sensing, Information 
Processing and Actuation, Synthetic Hybrid Biosensors and Synthetic Biology in 
Metabolic Engineering. The Synthetic Genomes section includes articles on Minimal 
Gene-Set Machinery; Production of the Mitochondrial Genome and Chromosomal 
DNA Segments, and Synthetic Genetic Polymers Functioning to Store and Propagate 
Information. As part of the introductory article, Sanjay Vashee, of the J. Craig Venter 
Institute, covers the recent work on the formation of a bacterial cell that is controlled by a 
chemically synthesized genome. This work was accomplished at the Institute where the 


xX 


Preface 


author is located. The next two sections cover important end use applications of synthetic 
biology. Diseases and Therapeutics covers synthetic biological improvements in use 
of stem cells in regenerative medicine and new approaches for vaccine development 
utilizing synthetic biology and finally the final section on Chemicals Production includes 
synthetic biology approaches for production of diols, biofuels and antibiotics. 

We are pleased that many of the major synthetic biology research institutions partic- 
ipated in preparation of our book, e.g. the Departments of Biological Engineering, the 
Synthetic Biology Center of the Department of Electrical Engineering & Computer Sci- 
ence, the MIT Microbiology Program, and the MIT Computational and Systems Biology 
Program all of the Massachusetts Institute of Technology; the ETH Zurich, the National 
Institutes of Health, and the J. Craig Venter Institute. 

Our team hopes that you, the reader, will benefit from our hard work — finding the 
content useful in your research and education. We wish to thank our Managing Editor, 
Sarah Mellor as well as our Executive Editor, Gregor Cicchetti for both their advice and 
hard work in the course of this project. 


Larkspur, California, January 2015 Robert A. Meyers 
RAMTECH Limited 
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Keywords 


Synthetic biology 

An effort to construct biological systems, which may include entire biosynthetic 
pathways, synthetic organelles and cellular structures, and whole organisms, that 
have medical, industrial, and scientific applications. This is achieved via the application 
of engineering principles, such as hierarchical design, modular reusable parts, the 
isolation of unrelated functions, and standard interfaces. 


Synthetic cell 
A cell that is controlled solely by a genome that was assembled from chemically 
synthesized pieces of DNA. 


DNA assembly 
The building of larger DNA fragments from smaller DNA fragments. 


Circuits 
A collection of various modular component parts that responds to an input signal that 
is then relayed to produce an output signal. 


Compartmentalization 
The spatial sequestering of substrates, intermediates, products, enzymes, and activities. 


Synthetic biology is an effort to construct and engineer biological systems, ranging 
from individual genetic elements, to biosynthetic pathways, to whole organisms. 
The results of these engineering efforts can be of great value to human interests such 
as medicine and industry. In this chapter, advances in DNA assembly technologies 
are reviewed, and how these advanced DNA assembly technologies, in conjunction 
with the application of engineering principles such as modular parts, have facilitated 
the rational engineering of organisms to obtain desired functions or to understand 
complex cellular behavior, are highlighted. The recent creation of a synthetic cell 
is also described. Finally, the societal concerns posed by synthetic biology are 
discussed. 


il 
Introduction 


The field of Synthetic Biology can be con- 
sidered more as an engineering discipline, 
and less as an empirical science. Efforts 
to create artificial life systems, both in 


biochemical systems [1] and in software 
environments [2], may also be consid- 
ered as Synthetic Biology, though these 
are beyond the scope of this chapter. 
Synthetic Biology is viewed as the effort 
to construct and engineer biological sys- 
tems of value to human interests. Such 


efforts can range in scopes far larger 
than the traditional genetic engineering of 
genes, to include the engineering of entire 
biosynthetic pathways complete with the 
regulation of the genes in that pathway 
[3, 4], synthetic organelles and cellular 
structures [5], whole organisms [6-9], and 
even ecosystems [10-13]. Synthetic Biol- 
ogy has the ambition to apply classical 
engineering principles such as hierarchi- 
cal design, modular reusable parts, the 
isolation of unrelated functions, and stan- 
dard interfaces. The empirical fields that 
correlate to Synthetic Biology are Systems 
Biology, Genetics, and Molecular Biology. 

Synthetic Biology is not a new field, but 
rather extends back into prehistory. For 
example, it has been determined that the 
process of engineering maize — a highly 
optimized domestic agricultural crop plant 
— from the wild grass teosinte began 
over 9000 years ago [14]. The method 
used by the pre-Columbian cultivators 
of teosinte was simple artificial selection 
which, as such, is very slow. However, 
with the discovery of laws of inheritance 
and natural selection [15-17], and the 
suggestion that DNA was the chemical 
medium of inheritance [18], the scene 
was set to engineer a living system in 
a far more direct and rapid manner. 
A prominent example of this is the 
Dupont Escherichia coli strain used for the 
production of 1,3 propanediol, in which 
case an entire biosynthetic pathway has 
been added to E. coli, and the metabolism 
of the bacterium substantially altered 
to allow for a majority of the carbon 
feedstock (glycerol) to be converted into 
the economically valuable chemical 1,3 
propanediol [6, 8, 9]. This feat, which 
was begun prior to the development of 
most of the Synthetic Biology techniques 
reviewed in this chapter, took many years 
and substantial investment to achieve. Yet, 


Synthetic Biology: Implications and Uses 


with recent advances in the field, such 
bioengineering projects will become faster 
to develop, easier to operate, and also much 
more ambitious. 

During recent years, Synthetic Biology 
has progressed in a manner which is very 
different from those of other engineering 
disciplines. This is because, unlike archi- 
tecture or software engineering, there is 
already a reservoir of highly sophisticated 
and complex functional parts to be found 
in Nature, and consequently most efforts 
in Synthetic Biology have been focused on 
harnessing that natural resource base. In 
general, two basic approaches have been 
undertaken to achieve this feat. The first 
approach has been to engineer natural 
organisms so as to incorporate recombi- 
nant pathways and other such desirable 
attributes. This method has the advantage 
of not requiring the capability to build — 
nor require an understanding of — mas- 
sive biological systems such as genomes 
and metabolisms. However, it does have 
the disadvantage of being undefined; that 
is, whilst certain genes of the organism 
might be the result of human intervention, 
most of the genome remains wild-type, 
and is neither subject to human control 
nor necessarily operating within the limits 
of human knowledge. 

The alternate approach is to use func- 
tional components, originally ‘mined’ 
from Nature, such as promoters, ap- 
tamers, protein-protein interaction do- 
mains, terminators, or ribosome-binding 
sites. These functional components can be 
cataloged and then used to compose larger 
defined constructions of genes, pathways, 
and even whole genomes. Because all 
the components of a synthetic biological 
system that are constructed in such an 
approach have precisely defined proper- 
ties, a high degree of predictive control 
over the final product is afforded. Indeed, 
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this more defined approach has become 
synonymous with advances in Synthetic 
Biology. 

Historically, one of the main limitations 
in following this defined approach relates 
to the knowledge of these natural systems 
that serve as a source of parts. As biology 
has been characterized, both new com- 
ponents for synthetic biology — and also 
new tools to utilize those components — 
have become available. In turn, a new 
Synthetic Biology capability has driven 
greater advances in the understanding of 
biology. This has been most obvious in 
the development of tools for the synthe- 
sis and manipulation of DNA, the first of 
which were developed via the discovery 
of restriction endonucleases, DNA ligases, 
and the creation of recombinant DNA 
molecules [19-24]. These tools allowed 
the development of the recombinant DNA 
cloning and expression techniques that ul- 
timately made possible the exploitation of 
enzymes with desired activities and prop- 
erties. Notably, the development of the 
polymerase chain reaction (PCR) greatly 
increased the ability to amplify and ma- 
nipulate DNA [25-27]. The subsequent 


combination of recombinant DNA cloning 
techniques with the PCR allowed a much 
greater exploitation of natural biological 
components, such as thermostable DNA 
polymerases, and this in turn made the 
PCR more robust and practical. Today, the 
PCR has become an indispensable tool for 
biology. Thus, knowledge of natural sys- 
tems has led to an improved technology 
for exploiting those systems, which has 
in turn provided an improved knowledge 
of biological systems in a double feed- 
back loop, thus improving both scientific 
understanding and technological capabili- 
ties. As shown in Fig. 1, as knowledge of 
the fundamental principles of biology have 
continued to grow, it has been possible to 
take a more defined engineering approach. 

In the past, advances in synthetic biol- 
ogy have been bounded by the capacity 
to assemble and modify DNA, as well 
as knowledge of biological parts and cir- 
cuits that such DNA might encode. Cor- 
respondingly, these two areas are directly 
addressed in the following sections, with 
details of synthetic biological pathways, 
synthetic genomes, synthetic organelles, 
and even synthetic organisms provided as 


Progress in Synthetic Biology is defined by the shifting of life-manipulation 
from the undefined to defined products and techniques. 


(fast, powerful, but 
requires substantial 
biological knowledge) 


———————————— 


(slow, limited in scope, 
but requires little or no 
biological knowledge) 


Wholly 
Undefined 
Custom 
Biological 
Systems 


Targeted DNA _— Cross Hybridization + 


A A 


Natural Variation + 


Artificial Selection Artificial Selection 


Wholly 
Defined 
Custom 
Biological 
Systems 
A A 
Distributed Genome 
Manipulation Manipulation 
Fig. 1 Increasingly defined biological engineering. A 


schematic of how biological engineering has emphasized a 
more defined and rational design as it has advanced, based 
on a greater knowledge of natural biological systems. 


examples of the capabilities that Synthetic 
Biology has already begun to deliver. 
These new capabilities create, in turn, 
new societal challenges, and these are also 
discussed. 


2 
DNA Assembly and Modification 


As discussed above, one way to view 
Synthetic Biology is as an engineering 
discipline aimed at manipulating cellu- 
lar systems to produce a de novo-designed 
function that does not exist in the nat- 
ural organism. As with other engineer- 
ing fields, however, Synthetic Biology is 
dependent on the tools and techniques 
available. 

Organisms carry out a variety of reac- 
tions aimed at their self-sustenance and 
self-replication, with such reactions being 
carried out by the proteins and RNAs en- 
coded in the organism’s genome. As the 
sequence of the DNA directs production 
of the proteins and RNAs of an organism, 
control of the cellular DNA therefore al- 
lows for an ability to direct the functions 
of a cell. Based on this principle, many 
of the basic tools utilized in Synthetic 
Biology are aimed at producing defined 
sequences of DNA molecules, and eas- 
ily manipulating the DNA content of an 
organism. The DNA molecules necessary 
for Synthetic Biology purposes can vary 
greatly in length, from individual DNA 
parts, genes and plasmids (containing tens 
to thousands of base pairs) to biosynthetic 
pathways and genetic circuits (thousands 
to millions of base pairs) to synthesizing 
whole genomes (viral and bacterial). 

For many years, gene cloning and DNA 
assembly were dominated by the use 
of restriction endonucleases and DNA 
ligases [19-24]. While some well-designed 
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restriction enzyme-based methods are still 
commonly in use [28, 29], these methods 
are gradually being superseded by the de- 
velopment of very rapid, more robust and 
less limited DNA assembly techniques. 
For example, although BioBricks were 
originally designed to be assembled with 
a restriction enzyme/ligation method [30], 
more recently an in vitro homologous 
recombination was adapted to increase 
the flexibility and speed of BioBrick 
assembly [31]. 

The starting materials for DNA se- 
quence construction may include chem- 
ically synthesized DNA oligonucleotides 
(oligos), natural DNA fragments, PCR 
products, or a combination of all three 
sources. Defined short single-stranded oli- 
gos have been commercially available as a 
commodity for many years, usually for use 
as primers in PCR, mutagenesis, and se- 
quencing reactions. Since chemically syn- 
thesized DNA oligos are of a user-defined 
sequence, this allows for a nucleotide level 
control of gene sequences and even entire 
genome sequences — a firm requirement 
when designing new functions in organ- 
isms. The idea of synthesizing genes from 
DNA oligos is not new; previously, oligos 
have been used to synthesize genes such 
as the alanine tRNA from yeast [32] and 
the human leukocyte interferon gene [33]. 
Likewise, the gene encoding a mammalian 
hormone, somatostatin, was synthesized 
and expressed in E. coli [34]. It is only 
recently that the lower costs of oligonu- 
cleotide synthesis and DNA sequencing 
have been combined to allow the devel- 
opment of more cost-effective and rapid 
methods for assembling groups of oligos 
into synthetic pieces of DNA or genes 
[35-39]. Today, several commercial gene 
synthesis companies exist that are able to 
produce custom genes/DNA at an accessi- 
ble cost per base pair, although such costs 
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can quickly become prohibitive if numer- 
ous different DNAs are required. The cost 
prohibition of large-scale gene synthesis 
can, in part, be overcome by utilizing nat- 
ural DNA fragments and PCR products 
in the DNA assembly for those sections 
of DNA that do not need to be created 
synthetically. 

Typically, the process of constructing 
DNA is hierarchical (Figs 2 and 3). 
Briefly, groups of smaller DNAs (single- or 
double-stranded) are mixed and assembled 
into larger DNA pieces. Figure 2 shows 
double-stranded DNAs being assembled 
into a larger construct, but the process can 
begin with single-stranded oligos as the 
substrates. These larger pieces (subassem- 
blies) are then grouped and assembled. 
These steps can be repeated until the fi- 
nal full-length DNA construct is obtained, 
whether it is a gene or genome. The DNA 
pieces to be assembled must have homol- 
ogous overlapping ends, the overlaps be- 
ing important because the DNA assembly 
techniques utilize homologous recombi- 
nation. For example, if three pieces of DNA 
(A, B, and C) are to be assembled into a sin- 
gle DNA molecule, then one end of piece A 
must have an overlap with piece B, and the 
other end of B must overlap piece C. This 
configuration will result in a linear DNA 
molecule, A-B-C. In order to generate a 
circle from these pieces, the end of C must 
overlap piece A. The assembly of DNA 
into a circle is most often achieved with 
a DNA piece which contains sequences 
that enable the final construct to be cloned 
into a desired host (e.g., E. coli, Saccha- 
romyces cerevisiae or Bacillus subtilis) [37, 38, 
40-42]. The DNA homologous recombina- 
tion reaction can be carried out completely 
in vitro in one step, either with an enzyme 
mix [37] or by using the PCR [35]. The 
reaction can also be performed in vivo by 


utilizing the natural homologous recom- 
bination activity ofan organism, suchas S. 
cerevisiae (yeast) and B. subtilis [7, 40-44]. 
Occasionally, a method will include an 
in vitro step to perform a partial reaction 
(DNA chewback/DNA annealing/DNA ex- 
tension), and an in vivo step to complete 
the reaction (DNA repair) [38, 45, 46]. 

The various in vitro homologous re- 
combination methods used to assemble 
double-stranded DNA or single-stranded 
oligos share the same general mechanism 
(Fig. 2). The nucleotides are first removed 
from one strand of the overlapping ends of 
the adjacent double-stranded DNAs, thus 
creating single-stranded ends of the DNA 
(Step 1). This process is analogous to a 
restriction enzyme digestion creating com- 
plementary sticky ends of DNA, except 
that the single stranded ends are typically 
20-60 nucleotides long. Depending on the 
method used, the nucleotides can be re- 
moved by applying an exonuclease activity 
from either the 5’ or 3’ ends of the DNA 
[37, 38]. The creation of single-stranded 
overhangs (Step 1) is not necessary if oli- 
gos are used as the starting material for 
DNA assembly, because the oligos are al- 
ready single-stranded. The single-stranded 
DNA overhangs are complementary to 
each other on adjacent molecules, and thus 
are able to anneal (Step 2). If the ends of 
the molecule being constructed are com- 
plementary, then the final construct will 
be circular (this is the normal method 
used when DNA is being assembled into 
a cloning or expression vector). The final 
step to the reaction is repair of the DNA. 
In the in vitro reaction, a DNA polymerase 
is used to fill the gaps, while a DNA ligase 
seals the nicks so as to create the larger as- 
sembled DNA molecule. The DNA repair 
activity of E. coli can be utilized to com- 
plete the reaction after the annealing step 
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SS FS rE 
es 5 3’ 5’ 


T5 exonuclease (5’) 
Exo III exonuclease (3’) 
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single-stranded complementary DNA 
either at 5’ or 3’ ends. 
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2. Annealing of single-stranded 
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3 5’ 3 5s: 5 at ends. 
Phusion DNA polymerase 
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In vitro repair (circular assemblies only) 


Pa 
Linear assembly product 


Amplify assemblies by PCR 


Fig. 2. Schematic depicting in vitro homol- 
ogous recombination DNA assembly. First, 
nucleotides are removed from either the 5’ or 
the 3’ ends of the DNA pieces (5’ removal de- 
picted). This step can be performed by several 
enzymes. The newly exposed single-stranded 
homologous ends (red, green, or yellow re- 
gions) on the adjacent pieces are comple- 
mentary, and can anneal. Providing homology 
at the ends of the DNA pieces will result in 
the assembly of a circular DNA molecule. 


Circular assembly product 


[Transformation] 


Individual assembly clones 


Following the annealing step, the DNA is re- 
paired by filling in the gaps with a DNA poly- 


merase and sealing any nicks with DNA ligase. 


A linear assembly product can be amplified by 
using the PCR and used in further assembly 
reactions. The circular assembly products are 
transformed into the appropriate host in or- 
der to isolated individual clones with the final 
assembled molecule. The assemblies can also 
be repaired in vivo after transformation by the 
native activities of E. coli. 
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[38] and after other enzymatic steps that 
remove errors [36]. The final assembly can 
be amplified by using the PCR and, if nec- 
essary, used in another round of assembly 
to generate even larger DNA molecules. 
If the final assembly is circular. it can 
be transformed into the appropriate host 
in order to generate individual assembly 
clones. 

The construction of DNA from oligos 
can also be performed in vivo using the re- 
combination activity of an organism. The 
yeast S. cerevisiae has robust homologous 
recombination activity, as well as an abil- 
ity to take up multiple pieces of double- or 
single stranded DNA [47-51]. Previously, 
Gibson and colleagues have shown that 
yeast can assemble at least 38 overlapping 
single-stranded 60-mer oligos directly into 
a plasmid, thus forming a 1170 bp DNA 
insert. Alternatively, fewer — but longer — 
oligonucleotides (up to 200 nt in length) 
that overlap by as little as 20 bp can also 
be used to assemble 1100 bp assemblies 
directly into the desired plasmid [43]. The 
ability of yeast to support at least 2 Mb of 
cloned DNA also makes yeast a good host 
for assembling large constructs. 
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At this point, a discussion of gene 
synthesis and DNA assembly methods, 
within the context of the synthesis of three 
different genomes, will be used to high- 
light and demonstrate a variety of DNA 
assembly methods that have been used to 
synthesize DNA sequences, starting from 
single-stranded oligos, and the hierarchi- 
cal assembly of the DNA subfragments 
into complete genomes. This is not meant 
to be a comprehensive list of all avail- 
able methods and techniques; rather, the 
intention is to demonstrate the flexibility 
of recently used methods in DNA assem- 
bly. Whilst the discussion will be within 
the framework of whole genome synthe- 
sis, these techniques can also be used to 


synthesize and/or assemble any DNA of 


interest, from a few base pairs in length to 
over a million. 

Gibson and colleagues have synthesized 
three genomes using both in vitro and 
in vivo assembly techniques [7, 40, 41]. 


Figure 3 shows a schematic flow of 


the hierarchical synthesis of the mouse 
mitochondrial genome (16299bp) (this 
has also been assembled, using a different 
method, by Itaya et al; see below), 


Fig. 3. Synthesis of mitochondrial and bac- 
terial genomes. The hierarchical assembly of 
three genomes is depicted with the sizes of 
the intermediate subassemblies and final prod- 
ucts on a logarithmic scale. The red arrows 
represent in vitro assembly, and the green 
arrows in vivo assembly in yeast. Values in 
parentheses indicate the number of pieces 

at that stage in the assembly. The colored 
bars on the left represent the several different 
DNA molecule classes that can be produced, 
and their relative sizes. (a) The mouse mito- 
chondrial genome was synthesized starting 
from 60 nucleotide-long oligonucleotides in 
four stages. All of the assembly steps were 
performed in vitro; (b) The Mycoplasma gen- 
italium genome was assembled from 5 to 
7kb cassettes purchased from a custom DNA 


synthesis company. Both, in vitro and in vivo 
DNA assembly techniques were utilized in the 
genome construction. The final stage of the 
assembly was performed in vivo, using yeast. 
The ability of yeast to take up multiple DNA 
pieces can eliminate several rounds of con- 
struction, as demonstrated by the 25-piece 
in vivo assembly of the genome; (c) The My- 
coplasma mycoides ssp. capri genome was 
assembled from over one thousand 1080 bp 
purchased DNA cassettes. The three assem- 
bly steps for this genome were all performed 
in vivo, with yeast as the host. The dotted 
red and green lines represent the availability 
of several in vitro and in vivo DNA synthesis 
methods that can be applied to produce the 
DNA cassettes for use in hierarchal DNA as- 
semblies. 
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the Mycoplasma  genitalium genome 
(582 970 bp), and the Mycoplasma mycoides 
ssp. capri genome (1077947bp). The 
basis of these DNA assembly methods is 
the DNA homologous recombination of 
overlapping DNA fragments. 

The mouse mitochondrial genome was 
synthesized starting from 600 overlap- 
ping 60nt-long single-stranded DNAs 
(60-mers), using an in vitro one-step as- 
sembly method (as discussed above) and 
PCR [52]. The first stage of genome con- 
struction produced 284 bp subassemblies 
by assembling groups of eight oligos di- 
rectly into pUC19 (Fig. 3a). Assembling 
the oligos directly into a cloning vec- 
tor allowed the individual assemblies to 
be cloned, isolated, and sequence-verified 
before continuing with the construction 
process. Following sequence verification, 
the correct 284bp first stage assemblies 
were amplified by PCR in order to generate 
more material. Following amplification, 
the first-stage assemblies were pooled into 
overlapping groups of five, and again as- 
sembled in vitro to produce the 1.2kb 
second stage assembly intermediates. The 
second stage assemblies were not propa- 
gated in a host organism, but rather were 
PCR-amplified immediately following the 
in vitro assembly reaction. These 15 PCR 
products were pooled into groups of five 
and then joined to form the 5.6kb third 
stage assembly intermediates. The assem- 
bly products were amplified by PCR (as 
before), and these three PCR products 
were then assembled to form the complete 
synthetic mouse mitochondrial genome. 
The final assembly reaction with the three 
PCR products also contained a bacterial 
artificial chromosome (BAC), so that the 
finished mitochondrial genome could be 
cloned into E. coli. This case demonstrates 
not only the ability to perform large-scale 
DNA construction almost entirely in vitro, 


but also that the process is amenable to 
automation. 

In 2008, the first synthetic bacterial 
genome was synthesized at the J. Craig 
Venter Institute [40]. The process to assem- 
ble the synthetic 582970bp Mycoplasma 
genitalium genome was hierarchical — sim- 
ilar to the mouse mitochondrial genome 
assembly (although the M. genitalium 
genome is about 35-fold larger). The syn- 
thetic M. genitalium genome was assem- 
bled from 101 synthetic DNA cassettes 
each of about 5—7 kb in length (Fig. 3b). 
In this case, the cassettes were synthe- 
sized from oligos by several different gene 
synthesis companies, and verified by se- 
quencing. The cassettes overlapped their 
adjacent neighbors by an average of about 
80 bp. In order to allow the formation of 
the circular genome, cassette 1 overlapped 
cassette 101. 

The main challenge in the synthe- 
sis of the M. genitalium genome was 
the assembly and cloning of synthetic 
DNA molecules larger than those previ- 
ously known. In the first stage, sets of 
four adjacent cassettes were assembled by 
in vitro recombination into a BAC vector to 
form circularized recombinant plasmids 
with about 24kb inserts that were then 
released by restriction enzyme-mediated 
digestion in preparation for the next as- 
sembly stage. The 25 first-stage assem- 
blies were taken three at a time to form 
the 72 kb second-stage assemblies, again 
by in vitro recombination. In the third 
stage, the 72kb second-stage assemblies 
were taken two at a time to produce 
four third-stage assemblies, each of ap- 
proximately one-quarter-genome (144 kb) 
in size. The first three stages of assembly 
were performed by in vitro recombina- 
tion and cloned into E. coli in order to 
generate more DNA for the subsequent 
rounds of assembly. The final stage of 


the M. genitalium genome assembly was 
carried out in vivo by utilizing the homolo- 
gous recombination activity of S. cerevisiae. 
The last stage of the genome assembly con- 
sisted of six overlapping pieces of DNA 
to generate the complete M. genitalium 
genome (one yeast vector, two fragments 
of quarter 3, and quarters 1, 2, and 4). 
The final step was performed in vivo be- 
cause limitations of the cloning host and 
in vitro assembly reaction became appar- 
ent. It is possible that larger assemblies 
(280-580 kb) are not stable in E. coli, but 
it is also possible that the circularization 
of large DNA molecules may be inefficient 
during the in vitro recombination reaction, 
and/or that handling large DNA molecules 
in solution leads to breakage of the DNA 
before transformation. 

Subsequently, the powerful ability of 
yeast to take up and assemble multiple 
large fragments of DNA was demonstrated 
by taking the 25 first-stage assemblies and 
assembling them in one step by using 
yeast [41]. This proved to be significant 
because it allows for fewer assembly steps, 
and thus greatly reduces the time required 
to construct large DNAs. 

By leveraging the DNA uptake and 
recombination capability of yeast, a 
three-stage hierarchical strategy was 
designed to assemble the 1077947bp 
M. mycoides ssp. capri genome. In this 
case, the assembly steps were performed 
entirely in vivo by transformation and 
homologous recombination in yeast, 
following the initial DNA cassette con- 
structions (Figs 3c and 4) [7]. This differs 
from the strategy used to construct the 
mouse mitochondrial and M. genitalium 
genomes, which used in part an in vitro 
homologous recombination reaction. The 
cassettes designed to assemble into the 
complete genome were generally each of 
1080 bp, with 80bp overlaps to adjacent 
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cassettes. As with M. genitalium, the 1078 
cassettes (each 1080bp long) were all 
produced commercially by the assembly 
of chemically synthesized oligos. To 
assist in the assembly process, DNA 
cassettes and assembly intermediates 
were designed to recombine in the 
presence of vector elements to allow for 
growth and selection in yeast. During 
the first stage of assembly, groups of 
10 of the 1080bp DNA cassettes and 
a vector were recombined in yeast to 
produce circular subassembly plasmids; 
these were then transferred to E. coli in 
order to easily generate the quantities 
of subassembly DNA required for the 
second-stage assembly step. 

For the second-stage assemblies, 10 of 
the 10 kb assemblies were pooled and their 
respective cloning vectors transformed 
into yeast to produce 100kb assembly 
intermediates. Circular plasmid DNA was 
extracted from yeast in order to proceed 
to the final assembly stage. In the final 
stage, 11 of the second-stage assemblies 
(100 kb each) were pooled, and the yeast 
transformation procedure was repeated 
a final time to produce the circular M. 
mycoides ssp. capri genome. 

Recently, various alternatives to us- 
ing yeast to clone and assemble large 
DNAs in vivo have been described. For 
example, Itaya and colleagues also con- 
structed the complete recombinant mouse 
mitochondrion (16.3kb) and rice chloro- 
plast (134.5 kb) genomes from their small 
contiguous DNA pieces. In these cases, 
the starting DNAs for the assembly were 
derived via a PCR and assembled in B. sub- 
tilis [53]. The latter bacterium has a very 
large capacity to uptake and assemble 
DNA, as demonstrated by cloning of the 
3.5 Mb genome of the photosynthetic bac- 
terium Synechocystis into the B. subtilis 
genome [42, 54]. This was accomplished 
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Fig. 4 The assembly of a synthetic M. my- 
coides ssp. capri genome in yeast. A synthetic 
M. mycoides genome was assembled from 
1078 overlapping DNA cassettes in three 
steps. In the first step, 1080-bp cassettes (or- 
ange arrows), produced from overlapping syn- 
thetic oligonucleotides, were recombined in 
sets of 10 to produce 109 approximately 10 kb 
assemblies (blue arrows). These were then 
recombined in sets of 10 to produce 11 ap- 
proximately 100kb assemblies (green arrows). 
In the final stage of assembly, these 11 frag- 
ments were recombined into the complete 
genome (red circle). With the exception of 
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two constructs that were enzymatically pieced 
together in vitro (white arrows), assemblies 
were carried out by in vivo homologous recom- 
bination in yeast. Major variations from the 
natural genome are shown as yellow circles. 
These include four watermarked regions (WM1 
to WM4), a 4kb region that was intention- 
ally deleted (94D), and elements for growth 

in yeast and genome transplantation. In ad- 
dition, there are 20 locations with nucleotide 
polymorphisms (asterisks). Coordinates of the 
genome are relative to the first nucleotide of 
the natural M. mycoides ssp. capri sequence. 
The designed sequence is 1077947 bp. 


by progressively assembling and editing 
contiguous DNA regions that cover the 
entire Synechocystis genome. It is impor- 
tant to note that the Synechocystis genome 
was not a circular free molecule (as are 
the other genomes), but rather was incor- 
porated as two pieces into the B. subtilis 
genome. 

Several general features have been iden- 
tified that might be desirable in a DNA 
synthesis method. Although, ideally, the 
method should have a low cost per base 
pair of DNA synthesized, the present cost 
of DNA synthesis is dominated by the price 
of the starting oligos. Oligonucleotides ob- 
tained from DNA microarrays can greatly 
reduce the cost of gene synthesis (esti- 
mated to be as much as an order of 
magnitude), because thousands of oligos 
can be synthesized on a single chip on 
a small scale. Gene synthesis from DNA 
microarrays can be hampered by the small 
quantity and complex mixture of the oli- 
gos obtained [55-57]. Regardless of the 
source of oligos, the cost of DNA synthe- 
sis can be greatly impacted by the amount 
of sequencing required to find a correct 
clone, and this leads to the importance 
of accuracy in gene synthesis. Errors in 
gene synthesis are unavoidable because 
the process of creating the starting DNA 
oligos is not perfect, and some fraction 
of the oligos will inevitably contain errors. 
Whilst the starting oligos can be purified 
by using different methods in order to 
minimize the number of error-containing 
oligos, such processes are expensive and 
time- consuming, and also are not com- 
patible with high-throughput gene synthe- 
sis. The sequencing of multiple clones is 
sometimes sufficient to identify the cor- 
rect synthesized DNA, depending on the 
efficiency of the assembly method used 
and the size of the DNA construct. Sev- 
eral error-correction methods have been 
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developed, however, in attempts to reduce 
the amount of sequencing required [36, 
58, 59]. 

The goals of Synthetic Biology often re- 
quire DNA to be manipulated from the 
nucleotide to the genome level. Although 
the methods available to generate synthetic 
DNAs from genes to genomes have been 
discussed, in many cases a genome may 
need to be modified by inserting, deleting, 
or replacing a gene or sequence. These 
modifications may be necessary in only a 
few places, either individually or simulta- 
neously in several noncontiguous places. 
Consequently, several DNA manipulation 
techniques have been developed to allow 
these types of change. 

Recombineering uses the activity of 
lambda phage enzymes to catalyze highly 
efficient homologous recombination in 
vivo [60]. The lambda Red system allows an 
easy and rapid modification of the genome 
of a compatible host. However, because 
only short homologous overlaps are re- 
quired for recombination with the lambda 
Red system, it is possible to use PCR to 
easily generate the modifying DNA by in- 
corporating the homologous overlaps into 
the primer design. Recombineering can 
be used to insert, delete or replace a gene 
or sequence. Moreover, if multiple modi- 
fications are needed in the same organism 
the process can be repeated sequentially, 
although the time required to make more 
than a few changes can become signifi- 
cant. This system is available for E. coli [61, 
62], Pseudomonas [63], and Salmonella [64], 
while a similar system based on a different 
phage has been developed for Mycobac- 
terium tuberculosis and M. smegmatis [65]. 
Based on the studies with M. tuberculosis, 
it is not unreasonable to speculate that 
a similar recombineering system can be 
implemented in many more organisms 
by exploiting their native phages. 
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By using the same lambda recombi- 
nation proteins, Wang and colleagues 
developed the technique of multiplex au- 
tomated genome engineering (MAGE) for 
the large-scale programming and evolu- 
tion of E. coli cells [66]. MAGE employs a 
mixture of single-stranded oligos to simul- 
taneously target many locations on the 
chromosome for modification either in 
a single cell, or across a population of 
cells. Selection markers are not necessary 
because the process is efficient, iterative, 
and cumulative. The highest efficiencies of 
MAGE are observed when small changes 
are being made to the genome (a few 
base pairs), but the efficiency is much 
reduced when larger changes such as in- 
sertions (>20 bp) or deletions (>1000 bp) 
are attempted. The MAGE process is able 
rapidly to produce combinatorial genomic 
diversity. Indeed, the power of MAGE 
has been demonstrated by tuning the 
translation of 20 endogenous genes and 
optimizing the production of lycopene in 
E. coli, Warner et al. have combined a 
molecular barcode technology with recom- 
bineering to develop trackable multiplex 
recombineering (TRMR) [67], such that 
thousands of specific genetic modifica- 
tions can be produced simultaneously, by 
recombineering. In this case, each mod- 
ification is associated with a molecular 
barcode, and the barcode sequences and 
microarrays can then be used to quan- 
tify the allele frequency in the population; 
this, in turn, allows mapping of the genetic 
modifications that affect a trait of interest. 
This technique may be useful when engi- 
neering a trait for which there is limited 
genetic knowledge. 

A method for the rapid engineering 
of multiple genetic changes in yeast 
was developed by Suzuki and colleagues 
[68]. This method, referred to as ‘‘Green 
Monster,” employs an inducible green 


fluorescent protein (GFP) reporter gene 
to create individual deletions in sepa- 
rate yeast strains. The deletions in the 
individual strains are then combined 
by repeated rounds of mating, meiosis, 
and flow cytometry-based enrichment. In 
each sexual cycle, progeny bearing an 
ever-increasing number of altered loci are 
enriched on the basis of gene dosage. 
If it could be adapted to extrachromo- 
somal DNAs, the Green Monster might 
represent a valuable technique for al- 
tering large DNAs or bacterial genomes 
which are cloned in S. cerevisiae. Although 
the method has been demonstrated using 
S. cerevisiae, it should be possible to extend 
the technology to bacteria with a mating 
equivalent such as bacterial conjugation 
(e.g., E. coli). 

In some cases, when the expression 
from a high gene dosage is desired, the use 
of plasmids may not be a feasible option. 
In this case, copies of the gene of interest 
could be inserted into the genome by using 
some of the above-described methods, but 
this may be both labor- and time-intensive 
ifa high gene dosage is required. In an at- 
tempt to overcome this problem, Tyo and 
colleagues used a plasmid-free, high-gene 
copy expression system termed chemically 
inducible chromosomal evolution (CIChE) 
to evolve an E. coli chromosome with about 
40 copies of a recombinant pathway [69]. 
This was achieved by creating a cassette 
that contained the genes of interest, along 
with a gene encoding antibiotic resistance 
for chloramphenicol. When the strain was 
then grown in increasing concentrations 
of chloramphenicol, the selective pressure 
of the increasing antibiotic concentration 
resulted in duplications of the cassette and, 
therefore, also of the antibiotic resistance 
gene by recA-dependent homologous re- 
combination. When the desired cassette 
copy number was reached, recA could 


be deleted to prevent any homologous 
recombination that could alter the cassette 
copy number. 

As shown above, many good methods 
have been devised for constructing and as- 
sembling either synthetic or natural DNA. 
However, several factors must be consid- 
ered and balanced when deciding on a 
DNA assembly strategy; these should in- 
clude the cost of the method, the time 
required, the individual’s experience with 
the different host organisms, and whether 
there is a need for vast combinatorial num- 
bers. Fortunately, the different assembly 
methods are diverse and robust, so that a 
large range of requirements can be met. 
In general, it appears that yeast may be 
the preferred host when constructing large 
DNAs, although several of the methods 
available to generate genes or smaller (ca. 
1kb) subassemblies for the construction 
of larger DNAs are equally suitable, de- 
pending on the user’s preferences. 


3 
Modular Parts and Circuits 


One core component of Synthetic Biology 
is to apply engineering-based approaches 
of modularization, rationalization, and 
modeling to control cellular behavior, in 
order to obtain desired functions or to 
understand complex biological systems 
[70, 71]. As a result, an increasing number 
of synthetic biologists have begun to apply 
electrical circuit analogies to biological 
pathways as a means of designing and 
generating synthetic genetic devices which 
can then be placed into cells to control 
their behavior [72]. The first examples of 
genetic circuits — the toggle switch and the 
repressilator — were demonstrated about a 
decade ago [73, 74], since when there has 
been an ever-expanding number of reports 
of various types of biological circuit, 
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including additional genetic switches 
[75, 76], other oscillators [77-80] and 
memory networks [81, 82], as well as other 
electronic-inspired genetic devices [70] 
such as pulse generators [83], logic gates 
[75, 84], filters [85], and communications 
modules [86, 87]. Today, these devices 
have begun to be used for practical 
applications in biosensing, therapeutics, 
and in the generation of important 
industrial products. As the scope of this 
chapter is very broad, it is only possible 
here to discuss these synthetic genetic 
devices and their uses with limited 
representative examples. A number of 
excellent recent reviews will provide more 
detailed information on these synthetic 
gene networks [70, 71, 88-90]. 

The basic design of circuits is the assem- 
bly of various modular component parts 
that respond to an input signal that is then 
relayed to produce an output signal. For 
biological circuits, the component parts 
have their origins in the vast amount of 
basic research in all aspects of the biologi- 
cal functions of organisms. The combined 
effort of many research groups has led to 
great understanding of the various path- 
ways that organisms use to respond to 
environmental signals, such as light or 
quorum sensing. Accordingly, many of the 
components of these pathways — such as 
signaling proteins and transcription fac- 
tors, as well as the promoters they control 
— have been identified from a wide va- 
riety of organisms [71, 89, 91]. In order 
to create some biological devices, syn- 
thetic biologists have taken components 
from one organism and placed them in 
a different model organism. For example, 
Danino et al. synthesized synchronized 
genetic oscillators by placing elements 
of the quorum sensing machineries of 
Vibrio fischeri and Bacillus thurigensis into 
the model organism, E. coli [77]. Given 
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the difference in GC (guanine-cytosine) 
content, codon usage, transcription and 
translation processes between the vari- 
ous organisms, it was not initially clear 
whether the devices created from the dif- 
ferent organism-derived component parts 
would function as intended. However, to 
overcome this difficulty several methods 
have been adapted. 

First, web applications such as 
GeneDesign [92] have been developed 
that enable synthetic biologists to design 
the component parts according to the 
specifications of the host organism 
(see Table 1). In addition, as discussed 
above, the recent advances in gene 
synthesis, as well as the decrease in the 
cost of oligos and the proliferation of 
gene synthesis companies, have allowed 
synthetic biologists simply to purchase 
the component parts. 

Second, Knight, Rettberg, Endy and oth- 
ers have founded the Registry of Standard 
Parts (Table 1) to provide a framework for 
synthetic biologists to devise biological cir- 
cuits, pathways, and other genetically en- 
coded systems. The idea here is to emulate 
the engineering principles involved in the 
construction of such things as electronic 
devices [93]. The Registry of Standard Parts 
provides a catalog of a wide variety of 
biological parts, such as transcription pro- 
moters and terminators, ribosome binding 
sites and regulatory proteins, as well as a 
variety of chassis in which the parts can 
function. This facilitates the ability of the 
synthetic biologists to tailor their biolog- 
ical devices by allowing them to choose 
from a variety of parts whilst, at the same 
time, providing a wealth of information on 
how these parts can function. The Registry 
also allows for interfacing with other web 
applications such as modeling tools. As 
an example, users can input information 
from the Registry into SynBioSS designer 


to generate kinetic models for the selected 
biological constructs, and provide a picture 
of how these constructs would influence 
the behavior of the whole [94]. 

With these advances in place, synthetic 
biologists are today beginning to use DNA 
assembly techniques to piece together 
modular parts into controlled biological 
pathways as well as biological circuits in 
a wide spectrum of applications, such as 
biosensing, the production of therapeu- 
tics and biofuels, as well as understanding 
complex cellular behavior and even ecosys- 
tems. For example, synthetic biosensors 
can be used to detect various environ- 
mental signals and then to prompt cells 
to enter a programmed behavior [70]. In 
one study, Kobayashi et al. generated a 
genetic device that could detect DNA dam- 
age and, through a designed activation of 
the SOS pathway, program E. coli cells to 
enter a biofilm state [95]. The ability was 
also demonstrated of designed bacteria to 
produce invasin from Yersinia pseudotuber- 
culosis, upon the detection of an hypoxic 
environment of tumor cells, which in turn 
allowed the bacteria to invade the tu- 
mor cells [96]. In a particularly innovative 
experiment, Looger et al. computation- 
ally redesigned protein—ligand specifici- 
ties to construct receptors that could bind 
trinitrotoluene or t-lactate. These recep- 
tors were then incorporated into synthetic 
bacterial signal transduction pathways to 
regulate gene expression in response to ex- 
tracellular trinitrotoluene or t-lactate [97]. 
The results of these studies confirm the 
promise that genetic biosensors of this 
type may be useful for detecting any de- 
sired environment or signal, and to allow 
the host organism to respond in a targeted 
manner. 

Biological circuits have also shown their 
possible value in the antibacterial thera- 
peutic field. For example, bacteriophages 
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Tab. 1 Resources for synthetic biology. 
Resource URL Comments 
Synthetic Biology _http://www.istl.org/10-spring/ Provides links to various resources on the 
Resources internet1.html internet, including synthetic biology 
associations, centers of research, ethics, 
training and educational resources, and 
journals 
Synthetic http://www.syntheticbiology.net/ Portal for professionals in synthetic 
Biology.net index.aspx biology providing information on news, 
events, products, suppliers, etc., 
regarding synthetic biology 
Synthetic Biology _http://www.synbioproject.org/ Established as an initiative of the 
Project Foresight and Governance Program of 
Woodrow Wilson International Center 
for Scholars to foster informed public 
and policy discourse concerning the 
advancement of Synthetic Biology 
BioBricks http://bbf.openwetware.org/ Encourages the development and 
Foundation responsible use of technologies based 
on BioBrick™ standard DNA parts to 
allow synthetic biologists to program 
living organisms in the same way a 
computer scientist can program a 
computer (see below) 
Registry of http://partsregistry.org/ A collection of genetic parts that can be 
Standard Main_Page mixed and matched to build synthetic 


Biological Parts 
SynBERC Synthetic 
Biology 
Engineering 
Research Center 


BIOFAB 


JBEI Registry 


GeneDesign 


SynBioSS Designer 


http: //www.synberc.org/ 


http://www. biofab.org/ 


https://public.jbeir.org/ 


http://www.genedesign.org/ 


http://synbioss.sourceforge.net/ 


biology devices and systems 

Mission is to develop technologies to 
build biological components and 
assemble them into integrated systems 
to perform designed tasks, train 
engineers for biology, and educate the 
public on Synthetic Biology 

Biological design—build facility that aims 
to produce useful collections of 
standard biological parts available to 
academic and commercial users 

Also aims to provide standard DNA parts 
for Synthetic Biology 

Set of web applications that provides 
public access to a nucleotide 
manipulation pipeline for Synthetic 
Biology 

Software suite for the generation, storage, 
retrieval, and quantitative simulation of 
synthetic biological networks 
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were engineered to suppress the SOS 
pathway of bacteria and enhance the 
killing of antibiotic-resistant bacteria, 
“persister cells,” and biofilm cells [98]. 
Another area where the ability to easily 
design and place modular parts into a 
designed biological pathway is proving 
useful is in the control of metabolic flux 
for the production of industrially impor- 
tant materials or chemicals. Recently, the 
Stephanopoulos group has reported an 
ability to increase the titer of taxadiene 
(an intermediate of the potent anticancer 
drug, Taxol) in an engineered E. coli strain 
[99]. This effect was accomplished by 
first partitioning the taxadiene metabolic 
pathway into two modules - a native 
upstream methylerythritol-phosphate 
(MEP) pathway forming isopentenyl 
pyrophosphate, and a yew tree-based het- 
erologous downstream terpenoid-forming 
pathway — and then varying the module’s 
expression simultaneously to obtain an 
improved balanced pathway. 

Genetic devices are also proving their 
worth in helping to understand the un- 
derlying basic principles of coordinated 
complex cell behavior. As examples, two 
relatively recent studies have highlighted 
efforts to emulate pattern formation that is 
important for development in higher eu- 
karyotes [85, 100]. The latter study used 
an in vitro approach with DNA-coated 
paramagnetic beads fixed by magnets 
in an artificial chamber to form arti- 
ficial transcription—translation networks 
that generate simple patterns [100]. In the 
former study, Basu et al. engineered two 
genetically distinct populations of bacte- 
ria (acyl-homoserine lactone senders and 
receivers) and manually overlaid them in 
different configurations to produce differ- 
ent patterns [85]. More recently, instead 
of using two distinct bacterial popula- 
tions, Tabor et al. genetically engineered an 


isogenic community of E. coli cells to sense 
light, to communicate to identify light-dark 
edges, and produce an image [101]. Similar 
biological circuits to those discussed above 
have also been utilized to construct syn- 
thetic ecosystems or biofilms to model and 
better understand microbial communities 
[10-13]. While relatively simple genetic 
devices have been used thus far, it is ex- 
pected that a thorough characterization 
of their performance, together with im- 
proved predictive mathematical tools, will 
allow for the design and construction of 
more elaborate circuits to program cells 
and cellular communities for functions 
that mimic those of natural systems. 


4 
Spatial Regulation 


Many of the parts and circuits detailed 
above involve the regulation of genes or 
gene-products. For most of the time, when 
gene regulation is referred to, itis assumed 
to be a temporal regulation — that is, the 
increase or decrease of expression of a 
gene that was, at a previous time, in the 
other state. It is, however, also possible 
to regulate genes and gene products in 
space. The ability to compartmentalize or 
localize enzymatic activities has long been 
recognized as a powerful means of opti- 
mizing catalysis. In Nature, this can take 
the form of macromolecular complexes 
that actively channel substrates from one 
enzymatic active site of a biosynthetic path- 
way to the next [102-105], organelles and 
compartments that can physically sepa- 
rate certain enzymes and substrates from 
the rest of the cellular processes [106, 
107], or it may simply be the result of 
reduced diffusion due to co-localization 
[108, 109]. Just as compartmentaliza- 
tion and co-localization can take many 
forms, so too they offer many advantages. 


These include the mitigation of toxicity of 
intermediates of a biosynthetic pathway, 
the protection of intermediates from diffu- 
sion or degradation, the elimination of un- 
productive side reactions as a consequence 
of the other biological activities of the host 
organism, or improving activity by driving 
kinetics (for an excellent review of both 
natural and engineered systems of spatial 
control over cellular processes, see Ref. 
[108]). In the past, organic chemists have 
been inspired by these natural systems and 
have created a wide variety of biomimetic 
catalysts that incorporate such features 
as cyclodextrin covalent linkers that allow 
many enzyme molecules to be tethered to 
one another, leading to improved kinetics 
[110]. However, whilst inspired by biol- 
ogy, and often using components derived 
from living organisms, such systems are 
not wholly biological and rarely operate 
in the biological context from which their 
component enzymes are derived. Conse- 
quently, such cell-free systems, which lack 
the self-replication capability of living cells, 
often suffer from problems of enzyme sta- 
bility and purification. An excellent review 
of biochemical constructs that incorporate 
compartmentalization and co-localization 
based around such diverse methods as co- 
valent and noncovalent linkers, of micelles 
made from both synthetic polymers and 
lipids, and of vesicles and viral particles 
used as nanoreactors, has been prepared 
by Vriezema et al. [111]. 

Here, examples of engineered, cellular 
compartmentalization strategies will be 
discussed, with emphasis placed on the de- 
sign principles that they embody and use. 


4.1 
Co-Localization 


Dueber et al. were able to use the synthetic 
biology principles of modular parts and 
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co-localization to create a heterologous 
synthetic protein complex in E. coli of three 
enzymes: acetoacetyl-CoA thiolase (A to 
B); hydroxy-methylglutaryl-CoA synthase 
(HMGS); and hydroxymethylglutaryl-CoA 
reductase (HMGR) [3]. These three 
genes, derived from yeast, compose a 
pathway that produces mevalonate from 
acetyl-CoA. Mevalonate is a precursor 
for the production of chemicals in the 
industrially and medically valuable large 
isoprenoid family [112]. However, due 
to very different levels of activity, these 
enzymes — even if expressed at optimal 
levels — result in a build-up of the toxic 
intermediate hydroxymethylglutaryl-CoA 
(HMG-CoA). To overcome this, Dueber 
et al. organized the three-enzyme pathway 
into a synthetic complex of enzymes on a 
“scaffold” by using the well-characterized 
signal processing protein—protein 
interaction domains of the metazoan cells 
[mouse SRC Homology 3 (SH3) and Post 
Synaptic Density protein, Drosophila Disc 
Large Tumor Suppressor, and Zonula 
Occludens-1 protein (PDZ) domains and 
the rat GTPase protein Binding Domain 
(GBD domain] with their corresponding 
ligands. Each ligand is a small tag-like 
sequence that can be added to either end 
of a protein, and which binds specifically 
to its corresponding domain. The SH3, 
PDZ and GBD domains were then 
expressed as a fusion protein that would 
recruit the three ligand-tagged enzymes 
of the mevalonate pathway into a single 
complex or “‘scaffold.”” The main aspects 
of the scaffold optimization and design 
involved the order of the binding domains 
(and, consequently, of the pathway 
enzymes with their ligands), and whether 
the ligand sequence was fused to the N 
or C terminus of each enzyme. Another 
aspect to be optimized was the number 
of each enzyme that was recruited to the 
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scaffold. Typically, a scaffold might be 
designed to have three PDZ domains, but 
only one SH3 and one GBD domain; this 
caused three copies of one of the pathway 
enzymes to be recruited to the complex, 
but only one each of the other enzymes. 
When these optimizations were made, the 
result was a dramatic 77-fold increase in 
the yield of mevalonate due to a reduction 
in the metabolic load on the cell and 
toxicity associated with HMG-CoA build 
up. The general nature of the method 
was also proved by its application to the 
pathway for glucaric acid synthesis [3]. 

The principle of synthetic complexes and 
co-localization can be used to improve not 
only synthesis pathways but also degrada- 
tive pathways. For example, the fungal 
cellulase Cel6A has been docked onto 
a bacterial mini-cellulosome to achieve 
a greater cellulose degradation and, as 
with the mevalonate biosynthetic path- 
way discussed above, issues of geometry 
and organization of the final synthetic 
scaffold strongly affected the efficacy of 
the result [113, 114]. As the technology 
of scaffold design improves, the develop- 
ment may begin of synthetic pathways 
that resemble tryptophan synthesis [102], 
polyketide synthesis [104] or carbamoyl 
phosphate synthetase [105], where the 
enzymes are not merely co-located but 
actually channel the substrate down a 
tunnel. 


4.2 
Compartmentalization 


Instead of recruiting enzymes to a scaf- 
fold and achieving a degree of control over 
the diffusion of substrates from enzyme 
to enzyme of a pathway, it might be 
advantageous to create a separate cellu- 
lar compartment to physically encapsu- 
late specific enzymes and substrates. An 


impressive example of this has been seen 
in the investigations of Parsons et al. [5], 
who characterized the genes of a natural 
bacterial organelle, which they called a bac- 
terial microcompartment (BMC). BMCs 
are polyhedral protein shells that are asso- 
ciated with specific biosynthetic pathways. 
The most well-studied BMC is the car- 
boxysome, which is associated with the 
fixation of carbon in Cyanobacteria [115]. 
Parsons et al. focused their attention on 
the pdu operon which contains a num- 
ber of genes, some of which are im- 
plicated in the construction of a BMC, 
while others are associated with Salmonella 
enterica serovar. typhimurium LT2’s path- 
way for the conversion of 1,2-propanediol 
into propionaldehyde, 1-propanol, and 
propionyl-CoA. These enzymes, and the 
reactions they catalyze, are localized within 
the BMC that the pdu operon encodes 
[106]. Based initially on the sequence sim- 
ilarity to carboxysome proteins, and later 
on the results of experiments where cer- 
tain proteins were subtracted, Parsons and 
coworkers identified a set of five genes 
(PduA, PduB, PduJ, PduK, and PduN) 
which produced six proteins (PduB pro- 
duces two versions of the protein) that are 
necessary and sufficient to produce empty 
BMC compartments in E. coli, similar to 
those produced by the native pdu operon 
in S. enterica serovar. typhimurium LT2. Al- 
though, not necessary, PduU was shown 
to help regulate the size of the BMC to the 
approximately 100nm dimensions it has 
natively. If the full set of required genes 
for BMC construction was not present, 
then large intracellular structures such 
as sheets, filaments, or hexagonal lattices 
were observed. To prevent these structures 
from forming, the wild-type order and 
orientation of the remaining Pdu genes 
needed to be maintained. Lastly, by using 
the N-terminal region of a gene in the Pdu 


operon, PduV, it was possible to create a 
tag that localized GFP into the BMC. 

Although Parsons et al. failed to demon- 
strate any enzymatic activity inside the 
resulting BMC, this was achieved in- 
side a compartment that was similar in 
some ways, but prepared from a cowpea 
chlorotic mottle virus capsid, albeit with 
the very simple single enzyme system of 
horseradish peroxidase [116]. It was also 
shown possible to recruit compartments 
that are already present in a cell, such as 
the periplasmic space [117]. 

In keeping with the modular approach 
central to most Synthetic Biology methods, 
Parsons et al. and Deuber et al. devised 
systems by which tags could be fused to 
enzymes, allowing for their localization 
and thus the spatial regulation. 


5 
The Synthetic Cell 


The organisms that have been engineered 
thus far for industrial or other purposes 
have had the advantage of being eas- 
ily manipulated genetically, because of 
high transformation efficiencies and good 
recombination activities. Unfortunately, 
many industrially relevant organisms do 
not possess these characteristics and there- 
fore, cannot easily be engineered; for these 
and other intractable organisms, novel 
engineering methods are necessary. One 
approach, as adopted by the research team 
at the J. Craig Venter Institute, has been to 
build a minimal cell that contains only es- 
sential genes, the functions of which have 
been characterized, in an attempt to un- 
derstand the basic principles of life. Such 
a cell may also provide a base into which 
various pathways can be placed to synthe- 
size industrially important products, but 
in a more energy efficient manner than 
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is possible with the presently engineered 
organisms. 

During their efforts to build a minimal 
cell, the J. Craig Venter Institute group re- 
cently reported the creation of a bacterial 
cell that could be controlled by a chemi- 
cally synthesized genome [7]. The research 
group described in detail the design, syn- 
thesis, and assembly of the 1.08 mega-base 
pair M. mycoides JCVI-syn1.0 genome (see 
Fig. 4), starting from digitized genome 
sequence information, and its transplan- 
tation into a recipient cell to create new 
M. mycoides cells that are controlled only by 
the synthetic chromosome. To distinguish 
the synthetic genome from the natural 
genome, the researchers placed “water- 
mark” sequences and included other de- 
signed gene deletions and polymorphisms 
in the synthetic genome (see Fig. 4). Even 
though the cytoplasm of the recipient cell 
is not synthetic, the research team referred 
to the cells produced after the transplanta- 
tion process as “synthetic cells,’’ because 
they are controlled solely by a genome that 
was assembled from chemically synthe- 
sized pieces of DNA. These synthetic cells 
have expected phenotypic properties, al- 
though the JCVI-syn 1.0 transplants grew 
slightly faster than a control strain. 

These studies, using several Mycoplasma 
species, were the culmination of extensive 
efforts over a number of years that led 
to the development of several novel tech- 
nologies (as summarized in Fig. 5). First, 
as discussed above, the team developed a 
strategy of assembling viral-sized pieces to 
produce large DNA molecules that allowed 
them to assemble bacterial genomes in 
S. cerevisiae [40, 41, 43]. Second, the team 
established additional methods to clone 
whole bacterial genomes as centromeric 
plasmids in yeast [118]. Third, they devel- 
oped methods to transplant the genome 
of one bacterial species, M. mycoides ssp. 
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Fig. 5 Moving a bacterial genome into yeast, 
engineering the genome, and its re-installation 
into a bacterium, by genome transplanta- 
tion. A yeast vector is inserted into a bacterial 
genome by transformation, and the genome is 
then cloned into yeast. Alternatively, a bacterial 
genome is cloned by transforming overlapping 
DNA fragments (natural or synthetic) together 
with a yeast vector into yeast and allowing the 
host’s homologous recombination system to 
assemble an intact genome. After cloning, the 


capri, into a different recipient bacterial cell 
species, M. capricolum ssp. capricolum, to 
obtain cells of the donor M. mycoides ssp. 
capri [119]. These studies were extended 
when the group was able to transplant 
the genome of M. mycoides ssp. capri, 
which this time was isolated from yeast 
as a centromeric plasmid, into recipient 
M. capricolum ssp. capricolum cells and 


repertoire of yeast genetic methods is used to 
create insertions, deletions, rearrangements or 
any combination of modifications in the bacte- 
rial genome. This engineered genome is then 
isolated and transplanted into a recipient cell 
to generate an engineered bacterium. Prior to 
transplantation, it may be necessary to methy- 
late the donor DNA in order to protect it from 
the recipient cell’s restriction system(s). This 
cycle can be repeated starting from the newly 
engineered genome (dashed arrow). 


produced viable M. mycoides ssp. capri cells 
[120]. The team also genetically altered the 
M. mycoides ssp. capri genome in yeast by 
using the host’s powerful genetic tools and 
newly developed tools [121] to produce a 
new strain of M. mycoides ssp. capri that 
would not have been possible with the 
tools currently available for these bacterial 
species. 


Taken together, the Venter Institute 
research team has developed a series of 
technologies that enabled them to clone 
whole bacterial genomes, whether from 
natural sources or synthetic pieces, to 
manipulate them, and to transplant them 
to produce viable bacterial cells (see Fig. 5). 
It will be interesting to see in the future 
whether this technology can be utilized for 
other intractable organisms. 


6 
Societal Challenges Posed by Synthetic 
Biology 


Along with the potential of significant 
benefit, all new technologies raise societal 
concerns. With respect to biotechnology 
in general, these can be described as 
concerns about bioterrorism, laboratory 
safety, harm to the environment, the 
distribution of benefits, and ethical and 
religious concerns [122-125]. 

Synthetic Biology itself — both at the level 
of research and in the application of such 
research to the development of new prod- 
ucts — also raises a variety of societal con- 
cerns, some of which are identical to those 
raised by all biotechnology, though some 
may be unique. These new or unique con- 
cerns may be especially important for the 
governance of the new technology. To its 
credit, the community of research workers 
that identify as synthetic biologists recog- 
nized at a very early stage that these societal 
concerns were both real and legitimate. 
A good number of the synthetic biologists 
have worked with a variety of policymaking 
and social science research communities, 
to ensure that the studies being carried out 
would be performed in a safe and ethical 
manner. Given much interaction between 
and among these various communities, 
concerns that are unique — or that have 
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been recognized as of great importance, 
even if not unique — have been well ana- 
lyzed, and while the policy problems have 
in no way been solved the challenges have 
been very well articulated. 

The first sets of societal concerns that 
were dealt with in detail by these com- 
munities were those of biosecurity and 
biosafety. There is a constellation of ethi- 
cal issues (discussed below) that are of no 
less importance than security and safety. 
However, it was clear very early on that 
if Synthetic Biology were to result in haz- 
ards that could not be mitigated for the 
research teams or for society as a whole, 
then there would need to be a morato- 
rium on such studies. Consequently, these 
societal concerns regarding security and 
safety were analyzed first by a variety 
of policy researchers. Hence, the safety 
and security analyses remain, for now, 
somewhat more advanced than the ethics 
analyses. 

In particular as much of the recent Syn- 
thetic Biology studies have been conducted 
in a post “9-11” environment, concerns 
relating to biosecurity were at the forefront 
of virtually all policy analyses. Specifically, 
the ability to synthesize genomes means 
that, at least in some cases, access to 
pathogens can no longer be physically lim- 
ited as long as the sequences are publicly 
available. For now, the concerns are about 
increasing the ease with which viruses, 
such as 1918 influenza, smallpox and 
Ebola, can be obtained [125]. Additionally, 
Synthetic Biology may eventually provide a 
relatively straightforward way to construct 
pathogens with increased virulence, by al- 
lowing those with nefarious intent to add 
a variety of pathogenesis factors directly to 
a viral genome or bacterial chromosome. 

In order to deal with these potential mali- 
cious applications, both the United States 
Government [126] and a consortium of 
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companies providing synthetic DNA [127] 
have released guidelines for the screening 
both of synthetic DNA orders, and of the 
customers placing these orders. Questions 
as to whether these guidelines should be 
legally binding, the full range of what the 
companies should be screening for, and 
whether orders for smaller pieces of DNA 
(oligonucleotides) should also be screened 
are all currently under discussion. 
Current biosafety or laboratory safety 
concerns are mostly focused on the speed 
and scale that Synthetic Biology brings 
to research, and some concerns about 
workers in the field who have not been 
trained as microbiologists. At institutes 
with a formal approach to dealing with 
biosafety — such as universities, research 
foundations and scientific companies — in- 
stitutional biosafety committees and other 
institutional groups will likely be taking up 
questions in Synthetic Biology research in 
the same way that they do all other bi- 
ological/biotechnological research. In an 
earlier report [125], policy researchers at 
the JCVI (Michele Garfinkel and Robert 
Friedman), MIT (Drew Endy), and the 
Center for Strategic and International 
Studies (Gerald Epstein) have described 
several options for such bodies for edu- 
cating themselves, and their institutional 
research teams, about what steps must be 
taken to ensure that the research is safe. 
A more generic set of concerns about the 
research teams, and how to mitigate any 
possible biosafety dangers, was discussed 
even earlier in a report from the National 
Academy of Sciences [128]. Yet another set 
of people interested in synthetic biology 
has also raised concern, namely the “do it 
yourself’ (DIY) community [129]. Whilst 
for the moment, little is being done by the 
DIY community that is clearly ‘‘synthetic 
biology,” there has been much discussion 
among the group eventually to employ 


those technologies. In anticipation of this 
possibility, the Presidential Commission 
for the Study of Bioethical Issues, which 
is in the process of completing a report 
on and recommendations for synthetic bi- 
ology, has addressed the issues of DIY 
specifically [130]. 

Although the safety and security con- 
cerns may not be “solved,” it appears at 
least that the great majority of issues have 
been laid out, and at least for the moment 
there seem to be no risks that would lead to 
the conclusion that the research should be 
banned, or even severely restricted. (This 
also appears to be the conclusion of the 
Presidential Bioethical Commission, al- 
though its current recommendations are 
only in draft form.) Thus, the policy- and 
social science research communities have 
turned at least some of their attention to 
other, broader societal challenges brought 
about by the potential use of Synthetic 
Biology technologies. 

Concerns about harm to the environ- 
ment from accidental or planned releases 
of engineered microbes date to discus- 
sions at the Asilomar meetings during the 
mid-1970s. The two critical concerns are 
that an engineered microbe will grow out 
of control if released accidentally or as part 
ofa planned release, and that DNA from an 
engineered organism may be transferred 
toa related organism. These concerns have 
been dealt with over time via guidance and 
regulation dealing with the containment of 
genetically modified organisms and rules 
for testing these organisms for release into 
the environment. Several US Government 
agencies are currently reviewing several 
sets of regulations and guidance to un- 
derstand whether they are sufficient to 
deal with the use of many new microbes 
in open environments. For example, the 
NIH Guidelines for working with recom- 
binant DNA are currently being reviewed 


to assure that guidance which was written 
to deal specifically with recombinant DNA 
applies equally to synthetic DNA [131]. 

In addition to the potential harm to peo- 
ple or to the environment (as discussed 
above), there are societal challenges be- 
yond the physical. The distribution of 
benefits and risks is a very long-standing 
concern, and one which surfaces for vir- 
tually every new technology. Ownership 
as defined by intellectual property rights, 
the concentration of knowledge, and re- 
sources in a small number of firms or 
institutes — and how and whether these 
resources should be shared - are is- 
sues that have become particularly acute 
around research groups (both academic 
and commercial) who wish to develop 
products. Interestingly, the Synthetic Biol- 
ogy community, in addition to civil society 
organizations, has placed this issue at the 
forefront of many of its own discussions 
[132]. These discussions could well lead 
to a better understanding of distribution 
concerns and possible solutions generally. 

Finally, hubris -— sometimes called 
“playing God” — might be the major 
nonphysical concern in Synthetic Biol- 
ogy, even if it is not fully unique in this 
case. In brief, concerns about hubris are 
focused on a key issue: Are there ac- 
tions that human beings simply should 
not take? In the case of constructing a 
synthetic cell, these questions arise for 
many communities, from religious tradi- 
tions to policymakers. Is constructing a 
synthetic cell creating life? If it is, is it 
hubris? If it is not creating life, then what 
would define creating life? In either case, 
is this hubris? And how might conduct- 
ing such experiments as constructing a 
synthetic cell change how human beings 
think of themselves, both individually and 
with respect to other organisms and the 
environment in general? There are long 
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and thoughtful writings — both fiction and 
nonfiction, and both recent and deep in 
history — about hubristic pursuits that will 
not be reviewed here. However, it should 
be noted that these questions are being 
studied in detail with respect to Synthetic 
Biology by philosophers, ethicists and the- 
ologians [133]. 


7 
Concluding Remarks 


Today, the capability is available to cre- 
ate any arbitrary DNA sequence, and to 
express that sequence in a wide variety 
of living systems. By using modular parts 
with standardized structures, this DNA 
assembly capability has allowed the en- 
gineering of pathways, organelles, organ- 
isms, tissues, and even ecosystems. Unlike 
previous genetic engineering methods and 
selective breeding methods, these engi- 
neering projects have the potential to be 
cheap, fast, and easy. As these capabilities 
are further refined and extended, an era 
can be anticipated in which biological en- 
gineering will impact every industry and 
activity of humankind. 


Acknowledgments 


The authors thank Dr. Carole Lartigue and 
Dr. Chuck Merryman, and also the Syn- 
thetic Biology members of the J. Craig 
Venter Institute (JCVI) for their stimulat- 
ing discussions. Apologies are offered to 
any Synthetic Biology colleagues whose 
work has not been cited in this chapter. 
Funding for the research conducted at the 
JCVI was provided by Synthetic Genomics, 
Inc. and by the Office of Science (BER), 
U.S. Department of Energy, Grant No. 
DE-FC02-02ER63453. 


25 


26 


Synthetic Biology: Implications and Uses 


References 


— 


Nn 


w 


wu 


a 


™“ 


eo 


wo 


Mansy, S.S., Szostak, J.W. (2009) Recon- 
structing the emergence of cellular life 
through the synthesis of model protocells. 
Cold Spring Harbor Symp. Quant. Biol., 74, 
47-54. 

Ray, T.S. (1992) An Approach to the Synthe- 
sis of Life, in: Langton, C.G. (Ed.) Artificial 
Life II, Addison-Wesley Publishing Com- 
pany, Inc., pp. 371-408. 

Dueber, J.E., Wu, G.C., Malmirchegini, 
G.R., Moon, T.S., Petzold, C.J., Ullal, A.V., 
Prather, K.L., Keasling, J.D. (2009) Syn- 
thetic protein scaffolds provide modular 
control over metabolic flux. Nat. Biotechnol., 
27 (8), 753-759. 

Ro, D.K., Paradise, E.M., Ouellet, M., 
Fisher, K.J., Newman, K.L., Ndungu, J.M., 
Ho, K.A., Eachus, R.A., Ham, T.S., Kirby, 
J. Chang, M.C., Withers, S.T., Shiba, Y., 
Sarpong, R., Keasling, J.D. (2006) Produc- 
tion of the antimalarial drug precursor 
artemisinic acid in engineered yeast. Na- 
ture, 440 (7086), 940-943. 

Parsons, J.B., Frank, S., Bhella, D., Liang, 
M., Prentice, M.B., Mulvihill, D.P., Warren, 
M.J. (2010) Synthesis of empty bac- 
terial microcompartments, directed or- 
ganelle protein incorporation, and evidence 
of filament-associated organelle move- 
ment. Mol. Cell, 38 (2), 305-315. 

Biebl, H., Menzel, K., Zeng, A.P., Deckwer, 
W.D. (1999) Microbial production 
of 1,3-propanediol. Appl. — Microbiol. 
Biotechnol., 52. (3), 289-297. 

Gibson, D.G., Glass, J.I., Lartigue, C., 
Noskoy, V.N., Chuang, R.Y., Algire, M.A., 
Benders, G.A., Montague, M.G., Ma, L., 
Moodie, M.M., Merryman, C., Vashee, 
S., Krishnakumar, R., Assad-Garcia, N., 
Andrews-Pfannkoch, C., Denisova, E.A., 
Young, L., Qi, Z.Q., Segall-Shapiro, T.H., 
Calvey, C.H., Parmar, P.P., Hutchison, 
C.A., II, Smith, H.O., Venter, J.C. (2010) 
Creation of a bacterial cell controlled by 
a chemically synthesized genome. Science, 
329, 52-56. 

Nakamura, C.E., Whited, G.M. (2003) 
Metabolic engineering for the microbial 
production of 1,3-propanediol. Curr. Opin. 
Biotechnol., 14 (5), 454-459. 

Zeng, A.P., Biebl, H. (2002) Bulk chem- 
icals from biotechnology: the case of 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


1,3-propanediol production and the new 
trends. Adv. Biochem. Eng. Biotechnol., 74, 
239-259. 

Balagadde, F.K., Song, H., Ozaki, J., 
Collins, C.H., Barnet, M., Arnold, F.H., 
Quake, S.R., You, L. (2008) A synthetic Es- 
cherichia coli predator-prey ecosystem. Mol. 
Syst. Biol., 4, 187. 

Brenner, K., Karig, D.K., Weiss, R., Arnold, 
F.H. (2007) Engineered bidirectional com- 
munication mediates a consensus in a 
microbial biofilm consortium. Proc. Natl 
Acad. Sci. USA, 104 (44), 17300-17304. 
Hu, B., Du, J., Zou, R.Y., Yuan, Y.J. 
(2010) An environment-sensitive synthetic 
microbial ecosystem. PLoS One, 5 (5), 
e10619. 

Weber, W., Daoud-El Baba, M., 
Fussenegger, M. (2007) Synthetic 
ecosystems based on airborne inter- and 
intrakingdom communication. Proc. Natl 
Acad. Sci. USA, 104 (25), 10435-10440. 
Matsuoka, Y., Vigouroux, Y., Goodman, 
M.M., Sanchez, G.J., Buckler, E., Doebley, 
J. (2002) A single domestication for maize 
shown by multilocus microsatellite geno- 
typing. Proc. Natl Acad. Sci. USA, 99 (9), 
6080-6084. 

Darwin, C. (1859) On the Origin of Species 
by Means of Natural Selection, or the Preser- 
vation of Favoured Races in the Struggle for 
Life, D. Appleton, New York. 

Fisher, R. (1936) Has Mendel’s work been 
rediscovered? Ann. Sci., 1 (2), 115-137. 
Mendel, G. (1866) Versuche  itiber 
Pflanzen-Hybriden, Verhandlungen des 
naturforschenden Verein Brunn. Abh. IV. 
Avery, O.T., Macleod, C.M., McCarty, M. 
(1944) Studies on the chemical nature of 
the substance inducing transformation of 
pneumococcal types: induction of transfor- 
mation by a deoxyribonucleic acid fraction 
isolated from Pneumococcus type Iii. J. Exp. 
Med., 79 (2), 137-158. 

Weiss, B., Richardson, C.C. (1967) Enzy- 
matic breakage and joining of deoxyri- 
bonucleic acid, I. Repair of single-strand 
breaks in DNA by an enzyme system 
from Escherichia coli infected with T4 
bacteriophage. Proc. Natl Acad. Sci. USA, 
57 (4), 1021-1028. 

Smith, H.O., Wilcox, K.W. (1970) A restric- 
tion enzyme from Hemophilus influenzae. 1. 


21 


22 


23 


24 


25 


26 


27 


28 


29 


30 


3 


— 


Purification and general properties. J. Mol. 
Biol., 51 (2), 379-391. 

Danna, K., Nathans, D. (1971) Specific 
cleavage of simian virus 40 DNA by 
restriction endonuclease of Hemophilus 
influenzae. Proc. Natl Acad. Sci. USA, 68 
(12), 2913-2917. 

Jackson, D.A., Symons, R.H., Berg, P. 
(1972) Biochemical method for inserting 
new genetic information into DNA of 
Simian Virus 40: circular SV40 DNA 
molecules containing lambda phage genes 
and the galactose operon of Escherichia 
coli. Proc. Natl Acad. Sci. USA, 69 (10), 
2904-2909. 

Cohen, S.N., Chang, A.C., Boyer, H.W., 
Helling, R.B. (1973) Construction of bio- 
logically functional bacterial plasmids in 
vitro. Proc. Natl Acad. Sci. USA, 70 (11), 
3240-3244. 

Lobban, P.E., Kaiser, A.D. (1973) Enzymatic 
end-to end joining of DNA molecules. J. 
Mol. Biol., 78 (3), 453-471. 

Saiki, R.K., Gelfand, D.H., Stoffel, 
S., Scharf, S.J., Higuchi, R., Horn, 
G.T., Mullis, K.B., Erlich, H.A. (1988) 
Primer-directed | enzymatic — amplifi- 
cation of DNA with a thermostable 
DNA polymerase. Science, 239 (4839), 
487-491. 

Mullis, K.B., Faloona, F.A. (1987) Spe- 
cific synthesis of DNA in vitro via a 
polymerase-catalyzed chain reaction. Meth- 
ods Enzymol., 155, 335-350. 

Saiki, R.K., Scharf, S., Faloona, F., Mullis, 
K.B., Horn, G.T., Erlich, H.A., Arnheim, 
N. (1985) Enzymatic amplification of 
beta-globin genomic sequences and restric- 
tion site analysis for diagnosis of sickle cell 
anemia. Science, 230 (4732), 1350-1354. 
Engler, C., Gruetzner, R., Kandzia, R., 
Marillonnet, S. (2009) Golden gate shuf- 
fling: a one-pot DNA shuffling method 
based on type IIs restriction enzymes. PLoS 
One, 4 (5), e5553. 

Engler, C., Kandzia, R., Marillonnet, S. 
(2008) A one pot, one step, precision 
cloning method with high throughput 
capability. PLoS One, 3 (11), e3647. 

Shetty, R.P., Endy, D., Knight, T.F. Jr 
(2008) Engineering BioBrick vectors from 
BioBrick parts. J. Biol. Eng., 2, 5. 

Sleight, S.C., Bartley, B.A., Lieviant, J.A., 
Sauro, H.M. (2010) In-fusion biobrick 


32 


33 


34 


35 


36 


37 


38 


39 


40 


Synthetic Biology: Implications and Uses 


assembly and re-engineering. Nucleic Acids 
Res., 38 (8), 2624-2636. 

Khorana, H.G., Agarwal, K.L., Buchi, H., 
Caruthers, M.H., Gupta, N.K., Kleppe, K., 
Kumar, A., Otsuka, E., RajBhandary, U.L., 
Van de Sande, J.H., Sgaramella, V., Terao, 
T., Weber, H., Yamada, T. (1972) Studies 
on polynucleotides. 103. Total synthesis of 
the structural gene for an alanine transfer 
ribonucleic acid from yeast. J. Mol. Biol., 72 
(2), 209-217. 

Edge, M.D., Green, A.R., Heathcliffe, G.R., 
Meacock, P.A., Schuch, W., Scanlon, D.B., 
Atkinson, T.C., Newton, C.R., Markham, 
A.F. (1981) Total synthesis of a human 
leukocyte interferon gene. Nature, 292 
(5825), 756-762. 

Itakura, K., Hirose, T., Crea, R., Riggs, 
A.D., Heyneker, H.L., Bolivar, F., Boyer, 
H.W. (1977) Expression in Escherichia coli 
of a chemically synthesized gene for the 
hormone somatostatin. Science, 198 (4321), 
1056-1063. 

Stemmer, W.P., Crameri, A., Ha, K.D., 
Brennan, T.M., Heyneker, H.L. (1995) 
Single-step assembly of a gene and en- 
tire plasmid from large numbers of 
oligodeoxyribonucleotides. Gene, 164 (1), 
49-53. 

Bang, D., Church, G.M. (2008) Gene syn- 
thesis by circular assembly amplification. 
Nat. Methods, 5 (1), 37-39. 

Gibson, D.G., Young, L., Chuang, R.Y., 
Venter, J.C., Hutchison, C.A., III, Smith, 
H.O. (2009) Enzymatic assembly of DNA 
molecules up to several hundred kilobases. 
Nat. Methods, 6 (5), 343-345. 

Li, M.Z., Elledge, S.J. (2007) Harnessing 
homologous recombination in vitro to 
generate recombinant DNA via SLIC. Nat. 
Methods, 4 (3), 251-256. 

Smith, H.O., Hutchison, C.A., III, 
Pfannkoch, C., Venter, J.C. (2003) 
Generating a synthetic genome by whole 
genome assembly: phiX174 bacteriophage 
from synthetic oligonucleotides. Proc. Natl 
Acad. Sci. USA, 100 (26), 15440-15445. 
Gibson, D.G., Benders, GA. 
Andrews-Pfannkoch, C., Denisova, E.A., 
Baden-Tillson, H., Zaveri, J., Stockwell, 
T.B., Brownley, A., Thomas, D.W., Algire, 
M.A., Merryman, C., Young, L., Noskov, 
V.N., Glass, J.I., Venter, J.C., Hutchison, 
C.A., III, Smith, H.O. (2008) Complete 


27 


28 


Synthetic Biology: Implications and Uses 


41 


42 


43 


44 


45 


46 


47 


48 


49 


50 


51 


chemical synthesis, assembly, and cloning 
of a Mycoplasma genitalium genome. 
Science, 319 (5867), 1215-1220. 

Gibson, D.G., Benders, G.A., Axelrod, 
K.C., Zaveri, J., Algire, M.A., Moodie, 
M., Montague, M.G., Venter, J.C., Smith, 
H.O., Hutchison, C.A., IIT (2008) One-step 
assembly in yeast of 25 overlapping DNA 
fragments to form a complete synthetic 
Mycoplasma genitalium genome. Proc. Natl 
Acad. Sci. USA, 105 (51), 20404-20409. 
Tsuge, K., Matsui, K., Itaya, M. (2003) One 
step assembly of multiple DNA fragments 
with a designed order and orientation in 
Bacillus subtilis plasmid. Nucleic Acids Res., 
31 (21), e133. 

Gibson, D.G. (2009) Synthesis of DNA 
fragments in yeast by one-step assembly of 
overlapping oligonucleotides. Nucleic Acids 
Res., 37 (20), 6984—6990. 

Shao, Z., Zhao, H. (2009) DNA assembler, 
an in vivo genetic method for rapid con- 
struction of biochemical pathways. Nucleic 
Acids Res., 37 (2), e16. 

Quan, J., Tian, J. (2009) Circular poly- 
merase extension cloning of complex gene 
libraries and pathways. PLoS One, 4 (7), 
e6441. 

Zhu, B., Cai, G., Hall, E.O., Freeman, G.J. 
(2007) In-fusion assembly: seamless engi- 
neering of multidomain fusion proteins, 
modular vectors, and mutations. Biotech- 
niques, 43 (3), 354-359. 

Ma, H., Kunes, S., Schatz, P.J., Botstein, 
D. (1987) Plasmid construction by homol- 
ogous recombination in yeast. Gene, 58 
(2-3), 201-216. 

Moerschell, R.P., Tsunasawa, S., Sherman, 
F, (1988) Transformation of yeast with 
synthetic oligonucleotides. Proc. Natl Acad. 
Sci. USA, 85 (2), 524-528. 

Orr-Weaver, T.L., Szostak, J.W., Rothstein, 
R.J. (1981) Yeast transformation: a model 
system for the study of recombination. 
Proc. Natl Acad. Sci. USA, 78 (10), 
6354-6358. 

Raymond, C.K., Pownder, T.A., Sexson, 
S.L. (1999) General method for plas- 
mid construction using homologous 
recombination. Biotechniques, 26 (1), 
134-138, 140-141. 

Raymond, C.K., Sims, E.H., Olson, M.V. 
(2002) Linker-mediated recombinational 
subcloning of large DNA fragments 


52 


53 


54 


55 


56 


57 


58 


59 


60 


using yeast. Genome Res. 12 (1), 
190-197. 

Gibson, D.G., Smith, H.O., Hutchison, 
C.A., III, Venter, J.C., Merryman, C. 
(2010) Chemical synthesis of the mouse 
mitochondrial genome. Nat. Methods, 7, 
901-903. 

Itaya, M., Fujita, K., Kuroki, A., Tsuge, K. 
(2008) Bottom-up genome assembly using 
the Bacillus subtilis genome vector. Nat. 
Methods, 5 (1), 41-43. 

Itaya, M., Tsuge, K., Koizumi, M., Fujita, 
K. (2005) Combining two genomes in one 
cell: stable cloning of the Synechocystis 
PCC6803 genome in the Bacillus subtilis 
168 genome. Proc. Natl Acad. Sci. USA, 102 
(44), 15971-15976. 

Richmond, K.E., Li, M.H., Rodesch, M,J., 
Patel, M., Lowe, A.M., Kim, C., Chu, 
L.L., Venkataramaian, N., Flickinger, S.F., 
Kaysen, J., Belshaw, P.J., Sussman, M.R., 
Cerrina, F. (2004) Amplification and assem- 
bly of chip-eluted DNA (AACED): a method 
for high-throughput gene synthesis. Nucleic 
Acids Res., 32 (17), 5011-5018. 

Tian, J., Gong, H., Sheng, N., Zhou, X., 
Gulari, E., Gao, X., Church, G. (2004) 
Accurate multiplex gene synthesis from 
programmable DNA microchips. Nature, 
432 (7020), 1050-1054. 

Zhou, X., Cai, S., Hong, A., You, Q., Yu, 
P., Sheng, N., Srivannavit, O., Muranjan, 
S., Rouillard, J.M., Xia, Y., Zhang, X., 
Xiang, Q., Ganesh, R., Zhu, Q., Matejko, A., 
Gulari, E., Gao, X. (2004) Microfluidic Pi- 
coArray synthesis of oligodeoxynucleotides 
and simultaneous assembling of multiple 
DNA sequences. Nucleic Acids Res., 32 (18), 
5409-5417. 

Carr, P.A., Park, J.S., Lee, Y.J., Yu, 
T., Zhang, S., Jacobson, J.M. (2004) 
Protein-mediated error correction for de 
novo DNA synthesis. Nucleic Acids Res., 32. 
(20), e162. 

Binkowski, B.F., Richmond, K.E., Kaysen, 
J., Sussman, M.R., Belshaw, P.J. (2005) Cor- 
recting errors in synthetic DNA through 
consensus shuffling. Nucleic Acids Res., 33 
(6), e55. 

Sawitzke, J., Thomason, L., Costantino, N., 
Bubunenko, M., Datta, S. (2007) Recom- 
bineering: in vivo genetic engineering in 
E. coli, S. enterica, and beyond. Methods 
Enzymol., 421, 171-199. 


61 


62 


63 


64 


65 


66 


67 


68 


69 


70 


71 


72 


Herring, C.D., Glasner, J.D., Blattner, F.R. 
(2003) Gene replacement without selection: 
regulated suppression of amber mutations 
in Escherichia coli. Gene, 311, 153-163. 
Datsenko, K.A., Wanner, B.L. (2000) 
One-step inactivation of chromosomal 
genes in Escherichia coli K-12 using PCR 
products. Proc. Natl Acad. Sci. USA, 97 
(12), 6640-6645. 

Lesic, B., Rahme, L.G. (2008) Use of 
the lambda Red recombinase system to 
rapidly generate mutants in Pseudomonas 
aeruginosa. BMC Mol. Biol., 9, 20. 
Karlinsey, J.E. (2007) Lambda-Red ge- 
netic engineering in Salmonella enterica 
serovar. typhimurium. Methods Enzymol., 
421, 199-209. 

van Kessel, J.C., Hatfull, G.F. (2008) My- 
cobacterial recombineering. Methods Mol. 
Biol., 435, 203-215. 

Wang, H.H., Isaacs, F.J., Carr, P.A., Sun, 
Z.Z., Xu, G., Forest, C.R., Church, G.M. 
(2009) Programming cells by multiplex 
genome engineering and accelerated evo- 
lution. Nature, 460 (7257), 894-898. 
Warner, J.R., Reeder, P.J., Karimpour-Fard, 
A., Woodruff, L.B., Gill, R.T. (2010) Rapid 
profiling of a microbial genome using 
mixtures of barcoded oligonucleotides. Nat. 
Biotechnol., 28 (8), 856-862. 

Suzuki, Y., St Onge, R.P., Mani, R., King, 
O.D., Heilbut, A., Labunskyy, V.M., Chen, 
W., Pham, L., Zhang, L.V., Tong, A.H., 
Nislow, C., Giaever, G., Gladyshev, V.N., 
Vidal, M., Schow, P., Lehar, J., Roth, F.P. 
(2010) Knocking out multi-gene redundan- 
cies via cycles of sexual assortment and 
fluorescence selection. Nat. Methods, 8, 
159-164. 

Tyo, K.E., Ajikumar, P.K., Stephanopoulos, 
G. (2009) Stabilized gene duplication en- 
ables long-term selection-free heterologous 
pathway expression. Nat. Biotechnol., 27 (8), 
760-765. 

Khalil, A.S., Collins, J.J. (2010) Synthetic 
biology: applications come of age. Nat. Rev. 
Genet., 11 (5), 367-379. 

Mukherji, S., van Oudenaarden, A. (2009) 
Synthetic biology: understanding biological 
design from synthetic circuits. Nat. Rev. 
Genet., 10 (12), 859-871. 

McAdams, H.H., Arkin, A. (2000) Towards 
a circuit engineering discipline. Curr. Biol., 
10 (8), R318—R320. 


73 


74 


75 


76 


77 


78 


79 


80 


8 


= 


82 


83 


84 


85 


Synthetic Biology: Implications and Uses 


Elowitz, M.B., Leibler, S. (2000) A synthetic 
oscillatory network of transcriptional regu- 
lators. Nature, 403 (6767), 335-338. 
Gardner, T.S., Cantor, C.R., Collins, J.J. 
(2000) Construction of a genetic toggle 
switch in Escherichia coli. Nature, 403 
(6767), 339-342. 

Dueber, J.E., Yeh, B.J., Chak, K., Lim, 
W.A. (2003) Reprogramming control of an 
allosteric signaling switch through mod- 
ular recombination. Science, 301 (5641), 
1904-1908. 

Kramer, B.P., Fussenegger, M. (2005) Hys- 
teresis in a synthetic mammalian gene 
network. Proc. Natl Acad. Sci. USA, 102 
(27), 9517-9522. 

Danino, T., Mondragon-Palomino, O., 
Tsimring, L., Hasty, J. (2010) A synchro- 
nized quorum of genetic clocks. Nature, 
463 (7279), 326-330. 

Fung, E., Wong, W.W., Suen, J.K., Bulter, 
T., Lee, S.G., Liao, J.C. (2005) A syn- 
thetic gene-metabolic oscillator. Nature, 
435 (7038), 118-122. 

Stricker, J., Cookson, S., Bennett, M.R., 
Mather, W.H., Tsimring, L.S., Hasty, J. 
(2008) A fast, robust and tunable syn- 
thetic gene oscillator. Nature, 456 (7221), 
516-519. 

Tigges, M., Marquez-Lago, T.T., Stelling, J., 
Fussenegger, M. (2009) A tunable synthetic 
mammalian oscillator. Nature, 457 (7227), 
309-312. 

Ajo-Franklin, C.M., Drubin, D.A., Eskin, 
J.A., Gee, E.P., Landgraf, D., Phillips, 
I., Silver, P.A. (2007) Rational design of 
memory in eukaryotic cells. Genes Dev., 21 
(18), 2271-2276. 

Friedland, A.E., Lu, T.K., Wang, X., Shi, D., 
Church, G., Collins, J.J. (2009) Synthetic 
gene networks that count. Science, 324 
(5931), 1199-1202. 

Basu, S., Mehreja, R., Thiberge, S., 
Chen, M.T., Weiss, R. (2004) Spatiotem- 
poral control of gene expression with 
pulse-generating networks. Proc. Natl 
Acad. Sci. USA, 101 (17), 6355-6360. 
Anderson, J.C., Voigt, C.A., Arkin, A.P. 
(2007) Environmental signal integration by 
a modular AND gate. Mol. Syst. Biol., 3, 
133. 

Basu, S., Gerchman, Y., Collins, C.H., 
Arnold, F.H., Weiss, R. (2005) A synthetic 
multicellular system for programmed 


29 


30 


Synthetic Biology: Implications and Uses 


86 


87 


83 


89 


90 


91 


92 


93 


94 


95 


96 


97 


pattern formation. Nature, 434 (7037), 
1130-1134. 

Bulter, T., Lee, S.G., Wong, W.W., Fung, 
E., Connor, M.R., Liao, J.C. (2004) Design 
of artificial cell-cell communication using 
gene and metabolic networks. Proc. Natl 
Acad. Sci. USA, 101 (8), 2299-2304. 

You, L., Cox, R.S., III, Weiss, R., Arnold, 


F.H. (2004) Programmed population 
control by cell-cell communication and 
regulated killing. Nature, 428 (6985), 
868-871. 

Aubel, D., Fussenegger, M. (2010) 
Mammalian synthetic —_ biology—from 
tools to therapies. BioEssays, 32 (4), 
332-345. 

Grunberg, R., Serrano, L. (2010) Strategies 


for protein synthetic biology. Nucleic Acids 
Res., 38 (8), 2663-2675. 

Haynes, K.A., Silver, P.A. (2009) Eukaryotic 
systems broaden the scope of synthetic 
biology. J. Cell Biol., 187 (5), 589-596. 
Choudhary, S., Schmidt-Dannert, C. (2010) 
Applications of quorum sensing in biotech- 
nology. Appl. Microbiol. Biotechnol., 86 (5), 
1267-1279. 

Richardson, S.M., Nunley,  P.W., 
Yarrington, R.M., Boeke, J.D., Bader, J.S. 
(2010) GeneDesign 3.0 is an updated 
synthetic biology toolkit. Nucleic Acids Res., 
38 (8), 2603-2606. 

Endy, D. (2005) Foundations for engineer- 
ing biology. Nature, 438 (7067), 449-453. 
Weeding, E., Houle, J., Kaznessis, Y.N. 
(2010) SynBioSS designer: a web-based tool 
for the automated generation of kinetic 
models for synthetic biological constructs. 
Brief: Bioinform., 11 (4), 394-402. 
Kobayashi, H., Kaern, M., Araki, M., 
Chung, K., Gardner, T.S., Cantor, C.R., 
Collins, J.J. (2004) Programmable cells: 
interfacing natural and engineered gene 
networks. Proc. Natl Acad. Sci. USA, 101 
(22), 8414-8419. 

Anderson, J.C., Clarke, E.J., Arkin, A.P., 
Voigt, C.A. (2006) Environmentally con- 
trolled invasion of cancer cells by en- 
gineered bacteria. J. Mol. Biol., 355 (4), 
619-627. 

Looger, L.L., Dwyer, M.A., Smith, J.J., 
Hellinga, H.W. (2003) Computational de- 
sign of receptor and sensor proteins 
with novel functions. Nature, 423 (6936), 
185-190. 


98 


99 


100 


101 


102 


103 


104 


105 


106 


107 


Lu, T.K., Collins, J.J. (2009) Engineered 
bacteriophage targeting gene networks as 
adjuvants for antibiotic therapy. Proc. Natl 
Acad. Sci. USA, 106 (12), 4629-4634. 
Ajikumar, P.K., Xiao, W.-H., Tyo, K.EJ., 
Wang, Y., Simeon, F., Leonard, E., Mucha, 
O., Phon, T.H., Pfeifer, B., Stephanopoulos, 
G. (2010) Isoprenoid pathway optimiza- 
tion for taxol precursor overproduction 
in Escherichia coli. Science, 330 (6000), 
70-74. 

Isalan, M., Lemerle, C., Serrano, L. (2005) 
Engineering gene networks to emulate 
Drosophila embryonic pattern formation. 
PLoS Biol., 3 (3), e64. 

Tabor, J.J., Salis, H.M., Simpson, Z.B., 
Chevalier, A.A., Levskaya, A., Marcotte, 
E.M., Voigt, C.A., Ellington, A.D. (2009) A 
synthetic genetic edge detection program. 
Cell, 137 (7), 1272-1281. 

Hyde, C.C., Ahmed, S.A.,  Padlan, 
E.A., Miles, E.W., Davies, D.R. (1988) 
Three-dimensional structure of the 
tryptophan synthase alpha 2 beta 2 
multienzyme complex from Salmonella 
typhimurium. J. Biol. Chem., 263 (33), 
17857-17871. 

Jorgensen, K., Rasmussen, A.V., Morant, 
M., Nielsen, A.H., Bjarnholt, N., 
Zagrobelny, M., Bak, S., Moller, B.L. 
(2005) Metabolon formation and metabolic 
channeling in the biosynthesis of plant 
natural products. Curr. Opin. Plant Biol., 8 
(3), 280-291. 

Pfeifer, B.A., Khosla, C. (2001) Biosynthe- 
sis of polyketides in heterologous hosts. 
Microbiol. Mol. Biol. Rev., 65 (1), 106-118. 
Thoden, J.B., Holden, H.M., Wesenberg, 
G., Raushel, F.M., Rayment, I. (1997) Struc- 
ture of carbamoyl phosphate synthetase: a 
journey of 96 A from substrate to product. 
Biochemistry, 36 (21), 6305-6316. 

Bobik, T.A., Havemann, G.D., Busch, 
RJ., Williams, D.S., Aldrich, H.C. 
(1999) The propanediol utilization (pdu) 
operon of Salmonella enterica serovar. 
typhimurium LT2 includes genes necessary 
for formation of polyhedral organelles 
involved in coenzyme B(12)-dependent 1, 
2-propanediol degradation. J. Bacteriol., 
181 (19), 5967-5975. 

Kofoid, E., Rappleye, C., Stojiljkovic, L., 
Roth, J. (1999) The 17-gene ethanolamine 
(eut) operon of Salmonella typhimurium 


108 


109 


110 


111 


112 


113 


114 


115 


116 


encodes five homologues of carboxysome 


shell proteins. J. Bacteriol, 181 (17), 
5317-5329. 

Conrado, R.J., Mansell, T.J., Varner, 
J.D., DeLisa, M.P. (2007) Stochastic 


reaction-diffusion simulation of enzyme 
compartmentalization reveals improved 
catalytic efficiency for a synthetic metabolic 
pathway. Metab. Eng., 9 (4), 355-363. 
Pettersson, H., Pettersson, G. (2001) Ki- 
netics of the coupled reaction catalysed by 
a fusion protein of beta-galactosidase and 
galactose dehydrogenase. Biochim. Biophys. 
Acta, 1549 (2), 155-160. 

Wenz, G. (1994) Cyclodextrins as building 
blocks for supramolecular structures and 
functional units. Angew. Chem. Int. Ed. 
Engl., 33 (8), 803-822. 

Vriezema, D.M., Comellas Aragones, M., 
Elemans, J.A., Cornelissen, J.J., Rowan, 
A.E., Nolte, RJ. (2005) Self-assembled 
nanoreactors. Chem. Rev. 105 (4), 
1445-1489. 

Martin, V.J., Pitera, D.J., Withers, S.T., 
Newman, J.D., Keasling, J.D. (2003) En- 
gineering a mevalonate pathway in Es- 
cherichia coli for production of terpenoids. 
Nat. Biotechnol., 21 (7), 796-802. 
Mingardon, F., Chanal, A.,  Lopez- 
Contreras, A.M., Dray, C., Bayer, E.A., 
Fierobe, H.P. (2007) Incorporation of 
fungal cellulases in bacterial minicellulo- 
somes yields viable, synergistically acting 
cellulolytic complexes. Appl. Environ. 
Microbiol., 73 (12), 3822-3832. 
Mingardon, F., Chanal, A., Tardif, 
C., Bayer, E.A., Fierobe, H.P. (2007) 
Exploration of new geometries in 
cellulosome-like chimeras. Appl. Environ. 
Microbiol., 73 (22), 7138-7149. 

Iancu, C.V., Ding, H.J., Morris, D.M., Dias, 
D.P., Gonzales, A.D., Martino, A., Jensen, 
G.J. (2007) The structure of isolated Syne- 
chococcus strain WH8102 carboxysomes as 
revealed by electron cryotomography. J. 
Mol. Biol., 372. (3), 764-773. 
Comellas-Aragones, M.,  Engelkamp, 
H., Claessen, V.I., Sommerdijk, N.A., 
Rowan, A.E., Christianen, P.C., Maan, 
J.C., Verduin, B.J., Cornelissen, J.J., Nolte, 
R.J. (2007) A virus-based single-enzyme 
nanoreactor. Nat. Nanotechnol., 2 (10), 
635-639. 


117 


118 


119 


120 


121 


122 


123 


124 


125 


Synthetic Biology: Implications and Uses 


Yang, C., Freudl, R., Qiao, C. (2009) Ex- 
port of methyl parathion hydrolase to the 
periplasm by the twin-arginine transloca- 
tion pathway in Escherichia coli. J. Agric. 
Food Chem., 57 (19), 8901-8905. 

Benders, G.A., Noskov, V.N., Denisova, 
E.A.,  Lartigue, C., Gibson, D.G., 
Assad-Garcia, N., Chuang, R.Y., Carrera, 
W., Moodie, M., Algire, M.A., Phan, Q., 
Alperovich, N., Vashee, S., Merryman, 
C., Venter, J.C., Smith, H.O., Glass, J.I., 
Hutchison, C.A., III (2010) Cloning whole 
bacterial genomes in yeast. Nucleic Acids 
Res., 38 (8), 2558-2569. 

Lartigue, C., Glass, J.I., Alperovich, N., 
Pieper, R., Parmar, P.P., Hutchison, C.A., 
Il], Smith, H.O., Venter, J.C. (2007) 
Genome transplantation in bacteria: chang- 
ing one species to another. Science, 317 
(5838), 632-638. 

Lartigue, C., Vashee, S., Algire, M.A., 
Chuang, R.Y., Benders, G.A., Ma, L., 
Noskov, V.N., Denisova, E.A., Gibson, 
D.G., Assad-Garcia, N., Alperovich, N., 
Thomas, D.W., Merryman, C., Hutchison, 
C.A., III, Smith, H.O., Venter, J.C., Glass, 
J.I. (2009) Creating bacterial strains from 
genomes that have been cloned and en- 
gineered in yeast. Science, 325 (5948), 
1693-1696. 


Noskoy, V.N.,  Segall-Shapiro, T.H., 
Chuang, R.Y. (2010) Tandem repeat 
coupled with endonuclease cleavage 


(TREC): a seamless modification tool for 
genome engineering in yeast. Nucleic Acids 
Res., 38 (8), 2570-2576. 

Cho, M.K., Magnus, D., Caplan, A.L., 
McGee, D. (1999) Policy forum: genetics. 
Ethical considerations in synthesizing a 
minimal genome. Science, 286 (5447), 2087, 
2089-2090. 

Cho, M.K., Relman, D.A. (2010) Genetic 
technologies. Synthetic ‘life,’ ethics, na- 
tional security, and public discourse. Sci- 
ence, 329 (5987), 38-39. 

De Vriend, H.C. (2006) Constructing Life — 
Early Social Reflections on the Emerging Field 
of Synthetic Biology, Working Document 97, 
Rathenau Institute, The Hague. 

Garfinkel, M.S., Endy, D., Epstein, G.L., 
Friedman, R.M. (2007) Synthetic genomics: 
Options for governance. Biosecur. Bioterror., 
5 (4), 359-362. 


31 


32 


Synthetic Biology: Implications and Uses 


126 Anonymous (2010) Screening Framework 
Guidance for Providers of Synthetic Double- 
stranded DNA, Department of Health and 
Human Services. Available at: http://www. 
phe.gov/Preparedness/legal/guidance/ 
syndna/Documents/syndna-guidance.pdf 
(accessed 7 April 2011). 

127 International Gene Synthesis Consortium 
(2009) Harmonized Screening Protocol: 
Gene Sequence & Customer Screening to 
Promote Biosecurity. Available at: http:// 
www.genesynthesisconsortium.org/Gene_ 
Synthesis_Consortium/Harmonized_ 
Screening_Protocol.html (accessed 7 April 
2011). 

128 Committee on Research Standards and 
Practices to Prevent the Destructive Ap- 
plication of Biotechnology (2004) in: Na- 
tional Research Council of the National 
Academies (Ed.) Biotechnology Research 
in An Age of Bioterrorism, The National 
Academies Press, Washington, DC. 

129 DIYbio (2010) An Institution for the Am- 
ateur Biologist. Available at: http://diybio. 
org/blog (accessed 7 April 2011). 


130 The Presidential Commission for the 
Study of Bioethical Issues (2010) Agenda 
for September 13-14. Available at: http:// 
www.bioethics.gov/meetings/091310/ 
(accessed 7 April 2011). 

Corrigan-Curay, J. (2009) NIH Guidelines 
for Research Involving Recombinant DNA: 
Biosafety and Synthetic Nucleic Acids, De- 
partment of Health and Human Services. 
Available at: http://oba.od.nih.gov/ 
biosecurity/meetings/200912/ 
CorriganCuray_NIH%20Guidelines% 
20Update.pdf (accessed 7 April 2011). 

132 Anonymous (2007) Presentations on 
Safety, Security, and Societal Issues, in Syn- 
thetic Biology 3.0. Available at: http://www. 
syntheticbiology3.ethz.ch/monday.htm 
(accessed 7 April 2011). 

Kaebnick, G., Murray, T.H., Parens, 
E. (2009) Ethical Issues in Synthetic 
Biology. Available at: —_ http://www. 
thehastingscenter.org/Research/Detail. 
aspx?id=1548 (accessed 7 April 2011). 


13 


_ 


13 


w 


Part | 
Biological Basis 


Synthetic Biology: Advances in Molecular Biology and Medicine 
First Edition. Edited by Robert A. Meyers. 
© 2015 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2015 by Wiley-VCH Verlag GmbH & Co. KGaA. 


33 


2 
The Emergence of the 
First Cells 


Antoine Danchin 
Research and Development, AMAbiotics SAS, Building G1, 2 rue Gaston 
Crémieux, 91000 Evry, France 


Introduction 37 


Origin of Prebiotic Metabolism and Compartments 38 

Comparative Genomics as a Way to Propose a Scenario for the Origins 
Surface Chemistry 41 

The Origin of Nucleotides and the RNA-Metabolism World 43 
From Substrates to Templates: the RNA-Genome World 46 


Origin of Extant Cells 48 

Origin of the Archaea 49 

Origin of the Bacteria 50 

From Protokarya to Eukarya 51 

Between Domains: the Perpetuation of Horizontal Gene Transfer 


Conclusion 54 


References 55 


Keywords 


Minerals 


35 


38 


52 


Chemical compounds formed as a result of geological processes and shaped in a variety 
of crystalline or semi-crystalline forms (e.g., clays, fool’s gold, crystals, etc.). 


Synthetic Biology: Advances in Molecular Biology and Medicine 
First Edition. Edited by Robert A. Meyers. 
© 2015 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2015 by Wiley-VCH Verlag GmbH & Co. KGaA. 


36 


The Emergence of the First Cells 


RNA world 

With the discovery of catalytic RNAs, Walter Gilbert used the expression ‘““RNA world” in 
1986 to describe this situation. This led some investigators to propose that life originated 
in a world made of RNA molecules (with the idea that RNA predated proteins). Because 
the RNA world implies the existence of both metabolic reactions and replication, which 
are quite distinct processes, it must be split into an ‘““RNA-metabolism world” (where 
RNA is a substrate) and an ““RNA-genome world” (where RNA is a template). 


Protokarya 

The family of phagocytic organisms that predated the split of living forms into Eukarya, 
Bacteria, and Archaea. These organisms combined together enclosed in a membrane 
two compartments derived from the RNA metabolism world (the cytoplasm) and the 
RNA-genome world (the nucleus). 


Horizontal gene transfer (HGT) 

The reproduction of living organisms implies that genes are transmitted from parents 
to their progeny, in a vertical fashion. However, another process of gene transmission 
has been discovered where genes are transmitted between individuals of the same 
generation, possibly from different species (via infection by viruses, conjugation or even 
direct invasion by nucleic acids, e.g., transfection, transformation). This transmission is 
therefore orthogonal to standard vertical gene transmission, and has been named HGT 
accordingly. 


The scenario proposed here builds on the critical need for compartmentalization at 
the origin of life. In a first step, the surface of minerals was compartmentalized and 
selected the reactive compounds that formed primitive metabolism. Subsequently, 
RNA molecules replaced mineral surfaces after the discovery of nitrogen fixation 
and the emergence of ribonucleotides, in parallel with a machinery for the synthesis 
of peptides, coenzymes, and lipids. The RNA-metabolism world then developed 
into an RNA-genome world based on RNA as informational templates rather than 
substrates. Bordered by lipids, the first cells were phagocytes, Protokarya, which 
put together two compartments stemming from the RNA-metabolism world (the 
cytoplasm) and the RNA-genome world (the nucleus). The emergence of stable 
deoxyribonucleotides allowed the clustering together of genes into chromosomes, 
while phagocytosis created the opportunity for an escape based on an alternative 
metabolism of membrane lipids and conquest of extreme environments, with the 
Archaea, and on the emergence of a robust and phagocyte-resistant envelope, with 
the Bacteria. Reductive evolution allowed bacteria with a modified enveloped to be 
phagocytosed again as symbionts of Protokarya, leading to the final generation of 
the Eukarya. Continuation of horizontal transfer of the genetic material initially 
resulting from phagocytosis was carried on with the emergence of gene transfer via 
specialized conjugation machineries and viruses. 


1 
Introduction 


Speculations about the origin of life are 
educated guesses rather than authentic sci- 
entific hypotheses. Based on ideas rooted 
ina particular culture and personal history, 
this domain pertains to the Greek dd€a 
(opinion), rather than to the domain of 
aAndeva (truth). Nevertheless, it is quite 
possible to try and take a Copernican view 
on the topic, attempting to escape anthro- 
pocentrism [1]. In what follows, the limi- 
tations imposed by physics and chemistry 
are merged to form an informational view 
of life, summarizing how information pro- 
gressively crept into physico-chemical as- 
semblies to give rise to what is now known 
as extant life. Importantly, there is no call 
for a single origin for life, nor is a com- 
mon ancestor to all living cells expected, 
as life is seen as an ongoing process of the 
creation of novel information [2]. 

In a nutshell, it can be proposed that life 
requires: 


e a machine (the “chassis” of synthetic 
biology [3]), defining an inside and an 
outside, which reproduces and allows 
the expression of 

e a program (a “cookbook”’), which repli- 
cates, and 

e aset of coupling processes, metabolism, 
which manages the flow of matter and 
energy and regulates information recur- 
sively from the program’s expression. 


In concrete terms, the split between the 
program and the machine has been es- 
tablished over many years of research, 
culminating in the proof of the concept 
when a genome was successfully trans- 
planted from a species of Mollicutes to 
another species of Mollicutes and shown 
to reprogram the host cell [4]. This demon- 
stration, however, opened up a specific 
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contrast between the reproduction of the 
host cell, and the replication of the genetic 
program. Remarkably, in a short book 
(somewhat overlooked despite its impor- 
tance), Freeman Dyson had convincingly 
established that information propagates 
along two lines: metabolism reproduces 
(makes a similar copy) while the pro- 
gram replicates (makes an identical copy). 
Dyson further showed that in any realistic 
scenario of the origin of life, reproduc- 
tion must predate replication, and that 
a reproducing metabolism must have ex- 
isted before the onset of the first living 
cells, and before the emergence of repli- 
cating templates [5]. As a consequence, 
a succession of at least two chemically 
independent origins was needed to ex- 
plain how life came to being. The first 
origin exposed the development and re- 
production of the ancestor of intermediary 
metabolism, until metabolic products (pre- 
sumably polymers) discovered the way 
to replicate. Dyson did not propose a 
scenario for the actual building up of 
metabolites, and nor did he discuss the 
emergence of nucleic acids. Neither did 
he discuss the asymmetry necessary to 
input information in the process; rather, 
he simply pointed out some of the con- 
straints that are required in any con- 
sistent scenario depicting the origin of 
life. 

At this point an important choice is to de- 
cide whether these scenarios are restricted 
to the Earth, or whether external origins 
are sought; hence, at this point Occam’s ra- 
zor is followed. Because investigating life 
outside the Earth as the source of extant 
life would make the question even more 
intractable than exploring the Earth’s past, 
the reflection is restricted to humankind’s 
limited knowledge of the past atmosphere 
of planet Earth and, of course, of extant 
life in the absence of compelling reasons 
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to seek outside origins (as is done in many 
scenarios [6, 7]). 

To pursue this reflection, a view of 
prebiotic metabolism is offered where, fol- 
lowing the emergence of nucleotides, with 
polyphosphates as energy store [8], nucleic 
acids begin as the substrates of metabolic 
reactions, before the emergence of nu- 
cleic acids as templates for the transfer 
of information, with the translation pro- 
cess at its core [9]. To this aim, various 
scenarios are first proposed to explain the 
origin of primary components (elementary 
biobricks), particularly basic amino acids, 
nucleotides, coenzymes, and lipids, which 
are required to allow the building up of 
compartmentalized cells. Subsequently, a 
scenario is proposed where populations of 
protocells displaying a general propensity 
to behave as phagocytes will evolve into 
the three domains of life as it is known 
today, namely Eukarya, Bacteria, and Ar- 
chaea. 


2 
Origin of Prebiotic Metabolism and 
Compartments 


To emphasize this standpoint, some ex- 
treme reactions in the scientific debate 
that animates this question can be given 
here. Two quotations expressing the most 
opposite views illustrate the situation. In 
an argument similar to that of Graham 
Cairns-Smith in his Genetic takeover and 
the mineral origin of life [10], Steven Benner 
and coworkers, for example, wrote that 
“arguments that attempt to extrapolate 
from modern biochemistry back to the ori- 
gin of life are futile’ [11]. In sharp contrast, 
Gtinter Wachtershauser described his own 
approach [12] as “...a reconstruction of 
precursor pathways by retrodiction from ex- 
tant pathways,” as was earlier proposed 


by Sam Granick in a scenario inspired by 
photosynthesis: 


“T shall propose [...] that this 
unit originated from some com- 
mon minerals; that the minerals 
that contain metal ions served 
both as coordinating templates 
and catalysts for various reac- 
tions, and that around this unit 
were formed organic molecules 
that gradually became organised 
into units of ever-increasing com- 
plexity. Gradually, biosynthetic 
chains developed in a stepwise 
fashion, using small molecules to 
make molecules of ever-increasing 
complexity. The metal catalysts 
became modified into the metal- 
loenzymes; in these new complexes 
the same metals would now act 
as more efficient catalyst. The ex- 
perimental method whereby it is 
proposed to find the evolution- 
ary precursors of protoplasm is to 
examine present-day biochemi- 
cal reactions in protoplasm and 
seek to relate them to reactions 
that may have occurred and may 
still occur in the minerals around 
us.” [13]. 


That latter stance, with arguments stem- 
ming from the study of comparative ge- 
nomics, is taken at this point. 


2.1 
Comparative Genomics as a Way to 
Propose a Scenario for the Origins 


For Cairns-Smith, even if a genetic system 
was generated in the early times of life 
developments, it has been taken over by 
novel systems that erased the memory of 
the past. This would prevent its traces in 
extant life from being uncovered [10]: life 


is a palimpsest. In contrast, for Granick 
[13], Wachtershduser [12] and the present 
author [14], there may be an expectation to 
see in present-day living organisms traces 
of what they were in the early times. It may 
even be possible to identify in the present 
set-up of living organisms archives that 
provide reminders of the way in which the 
origins unfolded [9, 15]. Here is how it 
goes. 

Comparative genomics is supposed to al- 
low investigators to seek conserved genes 
in genomes (orthologs [16]) as a way 
to infer conserved functions (in partic- 
ular metabolic functions). In the early 
times of genome sequencing, it was pro- 
posed as a straightforward outcome of the 
genome projects that comparison between 
genome sequences would identify con- 
served genes, allowing the most ancient 
functions to be identified [17]. Unfortu- 
nately, as more and more genomes were 
sequenced, the number of conserved or- 
thologs kept decreasing, until it came to 
none [18]. However, it should have been 
clear from the outset that there should 
never have been any expectation for genes 
to be generally conserved, as this would 
entail a single origin for life’s functions 
with no horizontal transfer of informa- 
tion. Yet, horizontal gene transfer (HGT) 
is the rule, at least in Bacteria [19]. If 
something is conserved, it should rather 
be expected that some functions — but 
not structures — would be ubiquitous, and 
because of HGT any innovation could 
spread rapidly and outcompete former 
structures. There is no reason for any bijec- 
tive correspondence between protein/gene 
structure and biological function. Several 
structures can fulfill the same function 
[16]. However, because functional innova- 
tion is prone to be passed on intact in the 
progeny of a living organism, there will 
be a tendency to propagate a successful 
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protein/gene structure over generations. 
This process will often result in groups of 
genomes (but by no means all genomes, 
and this makes that there is no universal 
ancestor to all life forms) keeping the same 
orthologous gene, corresponding to a par- 
ticular function. This outcome has been 
named gene persistence [20]. Therefore, in 
the absence of preservation of orthology, 
it remains possible to compare genomes 
and to create a family of persistent genes 
that will help to define a minimal set of 
functions required for the development of 
life [21]. 

The analysis of persistent genes has also 
allowed an exploration of the early times of 
what became extant cells. The way in which 
these genes are distributed in genomes 
showed that, despite the billions of years of 
divergence from common ancestors, their 
clustering within genomes tends to keep a 
well-defined organization. This is despite 
the huge catastrophic events which led 
to vast extinctions of living forms, prob- 
ably because many unicellular microbes 
can sustain very harsh conditions, or live 
in highly protected environments such as 
the bottoms of oceans. Briefly, persistent 
genes behave as if they formed a network 
of mutual attraction, clustering together 
genes coding for related functions (Fig. 1; 
modified from Ref. [22]). At the heart of the 
persistent genome is a highly connected 
gene network organized around genes cod- 
ing for RNA-mediated information trans- 
fer — the ribosome and translation, and the 
transcription machinery — and the general 
maintenance of the cell. A second, some- 
what fragmented, network is centered on 
RNA-metabolism as its organizing prin- 
ciple. This involves several RNA-wielding 
enzymes, tRNA synthetases, which are as- 
sumed to belong to the most ancestral 
enzymes [23, 24]. Furthermore, connected 
to some of these genes are found genes 
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organizing cell division, a core process in 
the making up of a progeny. Finally, a set 
of genes which are no longer connected to 
one another code for synthesis of the basic 
building blocks of the cell, the nucleotides 
and amino acids, as well as the core cat- 
alytic centers of proteins, the coenzymes 
(in particular iron-sulfur clusters). It also 
comprises genes that allow synthesis of 
the lipid bilayer that constitutes the cell’s 
membrane. In a retrograde way, this or- 
ganization is suggestive of a scenario of 
the origin of life, which might develop as 
follows. 


2.2 
Surface Chemistry 


Selection for chemical compounds within 
a wealth of similars is an essential step in 
the organization of prebiotic metabolism. 
As discussed by Cairns-Smith, a primor- 
dial “soup” such as that produced by 
Urey and Miller [27-29], would contain 
not only chemicals susceptible to fur- 
ther evolve into more complex compounds 
but also poisons that would counteract 
that evolution [10]. By contrast, mineral 
surfaces (carrying an excess of positive 
charges brought about by metal ions) can 
select molecules from an aqueous en- 
vironment and also concentrate locally 
molecules that are negatively charged 


< 
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(mainly carboxylates and phosphates), 
leaving other molecules diluted out in bulk 
water. Subsequently, the mineral-bound 
molecules come into contact and react 
together, and only those that are able 
to remain bound to the surface will be 
trapped for further chemical evolution. 
In addition to selecting specific classes 
of compounds and allowing them to fur- 
ther evolve, entropy-driven processes are 
crucial in this scenario because, on a 
surface, they favor polymerization, espe- 
cially when it is caused by the elimination 
of a water molecule (this is what usu- 
ally happens in biological polymerization) 
[12, 14]. In this context, iron-sulfur clus- 
ters [Fe-S] have been retained as pros- 
thetic groups in almost all living organ- 
isms (some lactobacilli excluded) [30]. 
Extant [Fe-S] clusters are involved in 
biologically important processes, ranging 
from electron transfer catalysis to tran- 
scriptional regulatory roles. 

Extant metabolism and its steps coded 
in persistent genes allow the scenario to 
be substantiated by bringing to mind clues 
for the first steps of a surface metabolism. 
Moreover, the requirement for an overall 
reproduction of metabolism stresses the 
need for some autocatalytic steps (because 
they provide a self-consistent means to 
stabilize the synthesis of those molecules 
that will be further metabolized [5, 31)). 


Fig. 1 The network of paleome genes. The 
proteome of bacteria coding for 1500 genes 
or more has been analyzed to identify gene 
persistence [20]. Subsequently, the persis- 
tence of gene clustering has been explored 
[25]. In this figure, the clustering of paleome 
genes is represented, using Cytoscape [26], 
as a graph of connections (blue lines). The 
overall connection network can be divided 
into three sets: the yellow rectangle displays 
a poorly connected network, made from genes 
mainly coding for intermediary metabolism; 


the pink network is organized around connec- 
tions with tRNA synthetases and a small clus- 
ter of genes involved in cell division; finally, 
the blue network codes for the ribosome, the 
translation, and transcription machineries. The 
color of the dots labels functional properties 
(e.g., black for regulation; dark blue for macro- 
molecule synthesis genes; red for modification 
of macromolecules and related processes; and 
green for intermediary metabolism). (Figure 
adapted from Ref. [22].) 
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It is worth noting that this process re- 
quires that clusters of metabolites belong 
to distinct compartments (no “primordial 
soup” would allow the process to develop), 
stressing again the importance of com- 
partmentalization [32]. In this scenario, 
coenzymes — molecules that are uniformly 
overlooked by investigators speculating 
about the origin of life - and nucleotides 
play a crucial role. The core of metabolism 
is made of triose-phosphates, and energy 
is derived from iron and sulfur redox tran- 
sitions, leading to the formation of a solid 
which has [Fe-S] clusters at its core, pyrite 
[12, 30]. Subsequently, the polymerization 
of amino acids and nucleotides gives rise to 
polymers which may progressively substi- 
tute for heterogeneous mineral particles. 
Among those, Cairns-Smith proposed that 
RNA molecules, as polyelectrolytes that 
could mimic clays, are the obvious sur- 
rogates of surfaces [10]. In parallel with 
Kornberg, Cairns-Smith also argued for 
an important contribution of polyphos- 
phates in the early management of en- 
ergy [8]. 

Amino acids and peptides are core build- 
ing blocks of a large number of essential 
compounds in extant metabolism. Many 
are also convincing candidates as belong- 
ing to surface metabolism, but the syn- 
thesis of basic amino acids, nucleotides, 
coenzymes, and lipids is not straightfor- 
ward [9, 10, 12]. Translation emerged later 
[32], and it can be seen how amino acids 
could be involved in chemical reactions 
yielding peptides (including isopeptides, 
that do not have the arrangement found in 
proteins) as well as more complicated com- 
binations. Remarkably, the extant so-called 
nonribosomal peptide synthesis provides a 
clue for a general process that could have 
been widely running before the emer- 
gence of the ribosome [33]. The general 
organization of the catalytic cycle of these 


enzymes is conserved in proteins making 
complex chemical compounds (generally 
polyketides, with an overall set-up which 
is prone to produce a considerable num- 
ber of variant compounds [34]) and, even 
more remarkable, it is also shared in 
the sequence and structure of enzymes 
producing long-chain fatty acids in organ- 
isms present in all three domains of life 
[35]. 


These enzymes permit a variety 
of molecules to react together, in- 
cluding D-amino acids and_ other 


non-proteinogenic amino acids, allowing 
synthesis of the final product in a stepwise 
fashion. They are organized in a more or 
less cyclical arrangement into modules 
required for loading the product in statu 
nascendi, catalyzing one single step of 
intermediary product elongation at a 
particular reactive group and modification 
of that functional group (Fig. 2). The 
number and order of modules and 
the type of domains present within a 
module on the enzyme determines the 
structural variation of the final product 
by dictating the number, order, choice 
of the reacting groups (amino acids in 
peptides) to be incorporated, and the 
modification associated with a particular 
type of elongation. Beside input of the 
starting substrate, two chemical reactions 
are crucial for the activity of these 
enzymes: (i) activation of the groups 
that will be successively input in the 
elongated product; and (ii) transfer to an 
active thiol of the activated group. The 
minimum set of domains required for 
an elongation cycle in peptide synthases 
consist of a module resulting in amino 
acid adenylation, followed by thiolation 
and transfer to the peptidyl carrier 
protein, and finally condensation. Central 
to these reactions is the formation of 
thioesters, via a swinging arm composed 


of 4-phosphopantetheine. The synthesis of 
polyketides is similar, using CoA-activated 
groups instead of adenylated amino acids. 
The same is true for fatty acid synthases, 
at the core of which is either a peptidyl 
carrier protein (for nonribosomal protein 
synthesis) or an acyl carrier protein 
(for polyketide synthases and fatty acid 
synthases), which are structurally related 
and, remarkably, coming from a common 
descent [36]. This core protein or protein 
domain is post-translationally modified by 
the addition of a 4-phosphopantetheine 
prosthetic group [37]. This process is well 
illustrated in Mycobacterium tuberculosis, 
which harbors several enzymes of these 
families [38]. In summary, these enzymes 
are producing a variety of compounds 
using “variation upon a theme,” with 
thiol transfers as the common chemical 
management of energy-rich bonds. The 
role of these energy-rich bonds has been 
convincingly pointed out by Christian de 
Duve, in another argument presenting 
living organisms as still carrying active 
archives of their origins [39, 40]. 


23 
The Origin of Nucleotides and the 
RNA-Metabolism World 


Metabolic studies usually focus on car- 
bon chemistry. Yet, carbon is only part of 
the life’s history, as the organic molecules 
present in living organisms comprise also 
a very high content of nitrogen. This sit- 
uation was not thought to pose a difficult 
problem 30 years ago, when models of the 
primitive atmosphere considered it to be 
strongly reducing, and rich in NH3_ a fairly 
reactive molecule [28]. However, the view 
of a reducing atmosphere 3.8 billion years 
ago has rapidly been challenged, and it is 
often now considered that the atmosphere 
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was mostly rich in H2O, CO, and No [41, 
42]. This implies that any realistic model 
of the origin of life must propose scenarios 
involving prebiotic nitrogen metabolism, 
including nitrogen fixation from dinitro- 
gen, a challenging chemical process. 
Following Granick’s suggestion, it may 
be rewarding to analyze the present-day 
situation of the various processes of 
nitrogen scavenging. In keeping with 
what has been discussed previously, 


these processes are strongly linked 
to the existence of [Fe-S] clusters. 
In particular, dinitrogen reduction 


requires the presence of iron-sulfur 
proteins such as ferredoxins as electron 
transfer intermediates and a rare atom, 
molybdenum, in the form of a complex 
cofactor, FeMo-co, comprising iron, 
sulfur, molybdate, and homocitrate [43]. 
However, it is now known that iron is 
the core element involved in nitrogen 
reduction [44, 45]. Furthermore, there 
exist nitrogenases which use vanadate 
instead of molybdate [46]. Ferredoxins 
are proteins constructed with a limited 
number of amino-acid types, and they 
contain an [Fe-S] cluster [29], typical of 
what could be expected for the proteins 
present very early on. Despite the apparent 
dispensability of molybdenum in models 
of the process, an important step in 
extant nitrogen fixation is the transport of 
molybdate. The latter is also associated 
with a cofactor, molybdopterin, which is 
present in enzymes of all three domains of 
life [47]. Molybdopterin is a nitrogen-rich, 
sulfur-containing coenzyme, that could 
therefore interact with metal—sulfur clus- 
ters, and is composed of a pterin moiety 
derived from guanosine triphosphate 
(GTP) by cyclization following the loss 
of a one-carbon formyl group. This step 
is catalyzed by a GTP cyclohydrolase, 
which yields pteridine triphosphate 
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Fig. 2. Outline of nonribosomal protein synthesis. Activated amino acids are bound to reactive thiols in the 

domains of a multidomain protein. Starting at an initiator site, a first amino acid is transferred to the thiol of 

a 4-phosphopantetheine arm that moves it to a second site, where it reacts with a second amino acid. The 

(iso) dipeptide is then moved to the 4-phosphopantetheine arm and the process is repeated until it reaches the final 
domain of the protein. Molecules that differ from amino acids are polymerized in the same way in structurally related 
molecules (polyketides). The synthesis of long-chain fatty acids and other lipids is achieved in a similar fashion. 


from GTP with the elimination of a 
one-carbon group (precisely, a group 
transported by pteridine-containing coen- 
zymes, consistent with the presence of 
an autocatalytic process). Interestingly, at 
least two families of these enzymes are 
distributed in all domains of life, and they 
are involved in the synthesis of several 
coenzymes [48]. As they produce formic 
acid, might it be possible to consider that 
the reverse reaction could be a model 
for the one-step synthesis of nucleotides 
without requiring the specific synthesis 
of ribose, a notoriously unstable molecule 
[49]? It could then be speculated that an 
autocatalytic process allowing the synthe- 
sis of pteridine (triphosphate could have 
produced the synthesis of GTP. In this 
frame of thought, GTP would have been 
but a side-product of on-going nitrogen 
fixation [14]. This speculation would ask 
for a process permitting the synthesis of 
pteridine from peptide or polyketide syn- 
thesis. The exploration of extant microbial 
metabolism might provide clues as to the 
plausibility of such processes. 

A subsequent constraint required that 
nucleotides be tightly coupled to energy 
metabolism. In particular, it was essen- 
tial that some stable energy store ex- 
isted. Among minerals, polyphosphates 
are obvious candidates; indeed, these com- 
pounds embed energy in the remarkably 
metastable phosphate bond [50], which can 
be hydrolyzed, but requires a large energy 
of activation to do so, thus making avail- 
able the phosphate bond as the cardinal 
“quantum” of energy, ultimately in all bio- 
chemical reactions. As noted by Kornberg, 
polyphosphates are resistant to desicca- 
tion or to irradiation, they can sustain 
very harsh conditions and nevertheless be 
available for a variety of metabolic reac- 
tions, including regenerating high-energy 
nucleotides [8]. They are still present today 
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in all cell types, and play a critical role in 
allowing cells to survive and produce a live 
progeny [51]. 

Once energy-rich ribonucleotides were 
available in a compartmentalized envi- 
ronment (reinforced by lipid synthesis or 
possibly vesicle-forming peptide synthe- 
sis [52]), they polymerized on surfaces 
in the presence of polyphosphates and 
peptides (some of which favor 3’—5’ bond 
formation over the more likely 2’—5’ bonds 
[53]). Oligonucleotides progressively sub- 
stituted to surfaces and played the role 
of a rigid holder, allowing for the lo- 
cal modification of substrates. So, might 
it be possible to find in today’s RNA 
molecules a class that could have played 
such a role? In 1975, Jeffrey Wong, when 
describing the structure of a possible uni- 
versal genetic code, proposed that the 
ancestors of transfer RNA were essen- 
tial partners of primeval metabolism [54]. 
The involvement of these RNA molecules 
in pre-translation metabolic processes is 
shown by the observation of some non- 
ribosomal synthesis of (iso)peptides by 
enzymes highly related to class I transfer 
RNA synthetases [55]. 

This role as metabolic supports is fur- 
ther illustrated by the remarkable number 
of modified bases in the tRNA molecules, 
in all three domains of life [56]. This is 
also visible in the apparently expletive 
nontranslational role of tRNA in several 
metabolic reactions [14, 29]. Besides be- 
ing substrates carrying metabolic reac- 
tions, the RNA molecules progressively 
discovered how they could bind various 
compounds in a highly specific manner 
(which is still present in today’s aptamers 
[57] or riboswitches [58]), and this allowed 
them to behave as catalysts, becoming ri- 
bozymes [59]. One crucial reaction that 
has been retained (in the ribosome) is the 
formation of peptide bonds [60]. It seems, 
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nevertheless, very likely that most of the 
catalytic reactions were performed by pep- 
tides associated with relevant cofactors (in 
particular in reactions involving electron 
transfers) despite the fact that, in the pres- 
ence of ferrous iron and in the absence 
of oxygen, RNA may succeed in catalyzing 
electron transfers [61]. At this point, a fur- 
ther interesting property of RNA emerges: 
ribonucleotide polymers will tend to fold 
back on themselves, opening up the dis- 
covery of self-complementarity. 


2.4 
From Substrates to Templates: the 
RNA-Genome World 


Upon folding in three dimensions, RNA 
opened the law of sequence complemen- 
tarity to selective processes, and this made 
possible the faithful replication of RNA 
molecules. In parallel, the translation ma- 
chinery, which was based on RNA catalysis 
[60] but associated with proto-tRNA syn- 
thetases [24], built up progressively with 
increasing specificity. The ancestor of the 
ribosome peptidyl transfer catalytic cen- 
ter produced peptides with a more or less 
specific sequence, using tRNA ancestors 
as the holding devices. A high specificity 
of the chaining of amino acids in pep- 
tides was achieved when a template RNA 
was used to organize a time-dependent 
series of tRNA-carrying amino acids as 
substrates, using the law of complemen- 
tarity to associate a particular tRNA to 
a sequence of the template RNA in 
a progressively increasing specific code, 
which became the present genetic code 
(see Ref. [9] for a presentation of some 
early hypotheses about tRNA as carriers 
of amino acids and formation of nu- 
cleotide folds made of what would become 
codon/anticodon interactions; for example 
Refs [62—64]). Many investigations have 


been devoted to account for the origin 
of present-day tRNA and of the genetic 
code (see recent examples in Refs [65—67]). 
The role of the ribosome as an RNA 
nanomachine catalyzing peptidyl trans- 
fer is well documented, and the pathway 
of ribosome evolution has been explored 
in detail by Caetano-Anolles and cowork- 
ers, with the conclusion that ancestral 
ribosomes were likely to be related to 
those present in extant Archaea [68]. Re- 
cent studies on the origin of tRNA syn- 
thetases, possibly originally specified by 
complementary polynucleotides [23, 24], 
have further substantiated the split of 
the code into codons used to specify hy- 
drophobic amino acids (NUN codons) and 
codons specifying hydrophilic amino acids 
(NAN codons). It should be noted how- 
ever that, because the table of the genetic 
code is of limited size, the statistical val- 
idation of any hypothesis is doomed to 
failure. The consequence again is that, 
as in most topics related to the origin 
of life, it is opinion — not science — that 
prevails. Hence, it is best to favor hy- 
potheses requiring the minimal number 


of assumptions. 
Following evolution of the 
“RNA-metabolism” world based on 


RNA as substrates and possibly catalysts, 
the ‘““RNA-genome”’ world based on RNA 
as templates was born. Vesicles containing 
coding RNA molecules split and fused 
repeatedly, propagating the most efficient 
metabolic systems and the associated 
coding system. The RNA-genome world, 
where RNA transcription and replication 
were still overlapping processes, explored 
a large domain of cell life based on RNA 
gene expression, probably with genes 
as isolated pieces of RNA, as in today’s 
RNA viruses (it is expected that this 
was the time when RNA viruses began 
to exist). However, these primitive cells 


depended on the steady synthesis of 
ribonucleotides and polyribonucleotides, 
which are chemically unstable molecules. 
Clearly, the RNA-genome world had to 
find a chemical means of stabilizing its 
coding capacity, or it would be doomed to 
disappear. 

In a context where polyphosphates were 
the ultimate energy store, polynucleotide 
synthesis and turnover was crucial. 
Because this synthesis involved a 
considerable amount of energy, the 
remarkable possibility of salvaging energy 
via phosphorolysis rather than hydrolysis, 
with the discovery of polynucleotide 
phosphorylase- present in all three 
domains of life [69, 70]—marked a 
turning point in the evolution of 
nucleotides, via the formation of a pool 
of ribonucleoside diphosphates [71]. 
Ultimately, the existence of this pool 
presented an opportunity for the discovery 
of deoxyribonucleotides, with the 
emergence of ribonucleotide reductase, 
one of the most abundant families of 
enzymes encoded in environmental 
metagenomes [72]. These enzymes 
belong to three related structural classes. 
Evolution of the dominant form, class 
I, has been affected by the invasion of 
the Earth’s atmosphere by dioxygen [73]. 
As a consequence of this activity, a new 
family of stable nucleic acids emerged, 
with DNA being considerably more stable 
than RNA. Subsequently, RNA replication 
had to evolve into two processes: (i) 
DNA replication, to generate the cell’s 
progeny; and (ii) DNA transcription into 
RNA, to express pieces which had to 
be expressed as transcripts, functioning 
either within the RNA realm (as aptamers, 
riboswitches, and ribozymes) or as 
templates for translation. 

The need for genes to be collectively ex- 
pressed in order to produce a metabolically 
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consistent cell created a selection pres- 
sure for having genes clustered in a small 
number of DNA polymers. This led to the 
formation of protochromosomes, whereby 
many genes were encoded simultaneously. 
While both DNA strands could, in prin- 
ciple, be coding in overlapping sequences 
[74], in many instances the genes could not 
overlap in complementary strands while 
remaining functionally significant. On the 
other hand, the replication of large pieces 
of DNA implied that the two strands could 
not be replicated in the same structural 
way: replication could be continuous for 
the leading strand, but the lagging strand 
had to be replicated in a piecemeal fash- 
ion. Okazaki RNA fragments can be seen 
as the remaining signature of the tran- 
sition between the RNA-genome world 
and the DNA-genome world [75]. At this 
point, a need to control the accuracy of 
replication implied that cells uncovered a 
variety of proofreading systems, allowing 
them to separate between the old template 
strand and the newly replicated strand, 
as well as errors deriving from the in- 
evitable rapid deamination of cytosine [76]. 
The emergence of thymine, as a mimic 
derivative of uracil, opened up this oppor- 
tunity. Indeed, deaminated cytosine reads 
as a U in DNA, which can be immedi- 
ately recognized and corrected back to C 
[77]. Also, during replication the newly 
synthesized strand would contain a small 
amount of uracil, which could be recog- 
nized and used as a tag, indicating its 
recent synthesis and allowing the correc- 
tion of errors in the new strand, but not in 
the old parent strand. A similar role could 
have been fulfilled before the discovery 
of thymine by the mistaken incorpora- 
tion of ribonucleotides in the place of 
deoxyribonucleotides in the nascent repli- 
cated strand [78]. In this context, thymine 
synthesis — which had been discovered 
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at least twice with structurally different 
thymidylate synthases distributed in a 
fairly haphazard fashion —- monitored the 
spread of repeated HGT [79]. 


3 
Origin of Extant Cells 


To summarize the preceding subsec- 
tions, prebiotic metabolism developed 
in iron-sulfur, polyphosphate-rich muds, 
presumably at a variety of locations on 
the Earth’s surface. This resulted in cy- 
cles of sustained development, followed 
by extinction on many occasions (this is a 
consequence of the reproduction features 
of metabolism [5]). The processes develop- 
ing there remained isolated for some time, 
but then came into contact with processes 
developed elsewhere, thus propagating in- 
novation from place to place. The most effi- 
cient networks of reactions, built between 
relatively autonomous metabolic clusters, 
progressively became invasive as they be- 
gan to become more accurate. Among the 
innovations was certainly the discovery of 
lipids, which resulted in the creation of 
lipid bilayers that compartmentalized re- 
actions in such a way that energy could 
be generated via vectorial proton transport 
across an electrochemical gradient, stored, 
and used in a controlled manner within 
lipid-bordered vesicles [80]. The behavior 
of vesicles with lipid bilayers is therefore 
essential to understand the evolution of 
prebiotic life (the RNA-metabolism world) 
and protobiotic life (the RNA-genome 
world). 

Whilst vesicles may split and merge, this 
requires some form of activation energy 
and specialized structural components. 
The general process of mitochondrial and 
endoplasmic reticulum development in ex- 
tant eukaryotic cells provides an idea of the 


corresponding underlying constraints [81]. 
They may also pass through one another, 
creating a widespread process similar to 
phagocytosis (see, for example, Ref. [82] 
and citations therein, noting that these 
authors propose a view of evolution of 
primitive forms which is somewhat at odds 
with the present one). Later, the vesicles 
present in the interior of the absorbing 
vesicle could either dissolve into the host 
or specialize progressively into novel func- 
tions. Phagocytosis is so widespread that 
it seems conceivable — and, in fact, quite 
likely — that the first RNA/DNA genome 
cells were phagocytes. It can then be 
speculated that early phagocytes harbored 
two compartments: (i) vesicles contain- 
ing the outcome of the RNA-metabolism 
world, carrying the translation machin- 
ery and which became the cytoplasm; 
and (ii) the RNA/DNA-genome world, 
which became the nucleus. The two early 
RNA-based processes — the reproduction 
of metabolism and replication of RNA 
mineral—substrate substitutes — evolved 
differently within primitive cells. These 
cells evolved as Protokarya with a pro- 
gressive organization of the genetic ma- 
terial into chromosomes, as DNA replaced 
RNA (Fig. 3). Interestingly, the analysis 
of extant genomes pleads for an origin 
based on cells with structured compart- 
ments [83]. Furthermore, the very exis- 
tence of phagocytosis had a remarkable 
consequence, as it created a competition 
between protocells, with some having to 
engulf competitors that were then di- 
gested. Although this process had a pos- 
itive outcome, as it spread horizontally 
metabolic innovations, in parallel it cre- 
ated as a side effect the opportunity for 
cells to escape being engulfed and di- 
gested. Thus, two widely different escape 
routes were discovered during the course 
of evolution. 


Surface metabolism 
Nucleotides 


Ribosome 


RNA-metabolism world 


Archaea 


Fig. 3 A scenario of origin of extant living 
species. The scenario is initiated by the onset 
of a surface metabolism (top of figure), where 
amino acids, nucleotides, coenzymes and 
lipids emerge from a succession of metabolic 
reactors organized on the surface of minerals 
comprising iron, sulfur, and polyphosphates. 
Once polymerized, the nucleotides form RNA 
molecules which substitute to surfaces and 
form the RNA-metabolism world. Self-folding 
uncovers the complementarity law, which ini- 
tiates the RNA-genome world. Ribose is re- 
duced to deoxyribose, creating DNA, a very 
stable polynucleotide which progressively sub- 
stitutes for RNA, initiating the DNA-genome 


3/1 
Origin of the Archaea 


Phagocytosis required that the lipid mem- 
brane had subtle properties that could 
only develop under highly restricted 
physico-chemical conditions, in particu- 
lar medium temperature, near-neutral pH, 
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world where thymine is discovered (at least 
twice). Association of the RNA-metabolism 
and the RNA/DNA-genome worlds creates 

a population of phagocytic Protokarya, with 
constant engulfment of competing cells. This 
favors the selection of cells that are resistant 
to phagocytosis, namely Archaea, via the for- 
mation of a novel class of lipids, and Bacteria, 
via the formation of a cell wall. Subsequently, 
Eukarya, which maintained the phagocytosis 
process of their Protokarya ancestors, incorpo- 
rate specific Bacteria as symbionts, generating 
mitochondria and chloroplasts, in a process 
which is still ongoing. 


and average ionic strength [84]. One way 
to escape being absorbed was therefore to 
create novel membranes that would pre- 
vent membrane fusion or internalization. 
This was achieved via the discovery of a 
novel class of lipids, based on ether bonds 
instead of esters and highly branched iso- 
prene side chains [85-87], and associated 
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with an altered chirality of the glycerol 
backbone [88, 89], as well as a variety 
of original metabolic pathways. The most 
significant differences between both types 
of lipids are the nature of the ether or 
ester linkages, and the nature of the hy- 
drocarbon chains (high methyl branching 
without unsaturation in Archaea, and al- 
most straight chains with unsaturation in 
Eukarya) [90]. By displaying these prop- 
erties [91], which separated them from 
the ancestors of Eukarya, the ancestors 
of Archaea progressively escaped phago- 
cytosis and developed in extreme environ- 
ments, where their original membranes 
were better adapted than those composed 
of standard lipids, in particular when 
forming an impermeable barrier under 
harsh conditions. In line with this hy- 
pothesis, it should be noted that Archaea 
are never pathogens. While this point is 
not yet understood, it may be expected 
to be linked to the particular nature of 
the Archaea membranes, which prevents 
them from creating stable hostile inter- 
actions with Eukarya. Indeed, although 
Archaea have seldom been identified as 
symbiotes of Eukarya [92, 93], they are 
able to form symbiotic associations within 
the Archaea domain (e.g., the Igniococcus 
hospitalis/ Nanoarchaeum equitans relation- 
ship [93]). 

The Archaea isoprenoid-derived mem- 
brane lipids are consistent with an anoxic 
environment, in keeping with the dis- 
covery of the corresponding metabolic 
pathways during the early times of life 
evolution. Together with the observation 
that Archaea are found thriving in hot 
springs and on intercontinental ridges — a 
situation thought to have prevailed in 
the early life of the Earth—they were 
often thought to descend from organ- 
isms that were at the origin of life [94]. 
In fact, they are still proposed to be at 


the origin of Eukarya [82, 95], and even 
at the origin of all three domains of 
life [96]. This view appears to be based 
on a popular, but erroneous, assump- 
tion (prokaryotes are often presented as 
“simple” organisms, forgetting the diffi- 
culties associated with engineering minia- 
turization) that single-cell organisms with 
a minimum number of compartments are 
more primitive than organisms with mul- 
tiple compartments. 


3.2 
Origin of the Bacteria 


Another way to escape phagocytosis was 
to evolve a complex outside envelope, as 
witnessed in today’s phagocytosis evasion 
processes [97]. A particular structural 
set-up of the cell’s envelope may account 
for the birth of Bacteria. Among the 
many features in common between 
Archaea and Eukarya is the synthesis 
of lysine, when present [98, 99]. In this 
particular pathway, the precursor of 
lysine is the non-proteinogenic amino 
acid L-alpha-aminoadipate, which must 
be protected by acetylation or other 
means to prevent its entry into proteins 
via translation (see Ref. [100] for the 
rationale of the protection/deprotection 
process). The need for this chemical 
protection process opened up the 
opportunity to create another pathway 
that would involve metabolites less prone 
to interfere with translation. This was 
achieved when the synthesis of another 
non-proteinogenic amino acid, L,L-, then 
meso-diaminopimelate, was discovered. 
Because of its head—tail dual amino acid 
structure, this metabolite could be used 
for a completely different function, as a 
crosslinking bridge between a variety of 
compounds, making them resistant to 


a variety of aggressions. The presence 
of a diaminopimelate synthesis pathway 
resulted in selection of the molecule 
as a core element in the complex 
peptidoglycan structure that now protects 
and shapes most bacterial envelopes. 
It was surmised that the emergence of 
such robust and rigid envelopes could be 
used to prevent phagocytosis, evolving a 
novel domain of life — that of the Bacteria. 
Subsequently, diaminopimelate could be 
used as a precursor of lysine, essentially 
in Bacteria (and some Archaea and 
plants [101], via the ubiquitous process 
of HGT) where it created a pathway that 
progressively displaced the more promis- 
cuous L-alpha-aminoadipate pathway 
[102]. Further evolution led to functional 
streamlining, with a progressive loss of 
compartments and fusion between the 
RNA-metabolism compartment and the 
RNA/DNA-genome compartment (some 
organisms such as Planctomycetes [103] 
or Parakaryon myojinensis [104] show that a 
large variety of unicellular types exist), and 
emergence of the major bacterial clades, 
monoderms (with a single cytoplasmic 
membrane) and diderms (with an inner 
membrane and an outer membrane) 
[105]. Within Bacteria, there still exist 
highly compartmentalized organisms: 
in anammox bacteria, for example, a 
single bilayer membrane divides the cell 
into three distinct cellular compartments 
that have distinct cellular functions. The 
anammoxosome, the innermost com- 
partment, manages energy metabolism, 
while the middle compartment, the 
riboplasm, contains both the nucleoid 
and ribosomes and thus has the features 
of the standard bacterial cytoplasm. 
Finally, the outermost compartment, 
the paryphoplasm, is yet of unknown 
function [106]. 
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3.3 
From Protokarya to Eukarya 


Beside its emphasis on the role of com- 
partmentalization, this view is in line 
with a study of the evolution of protein 
domains which suggests that the ancestors 
of prokaryotes are not made from Bacteria 
but rather are derived from ancestors of the 
Eukarya. Kurland and coworkers argued 
that, because of several drastic bottlenecks 
during the evolution of life on Earth (at 
least six major extinction events), what is 
tended to be recognized as the common 
ancestor of all living species could simply 
be a population of survivors of one of 
these catastrophic events. As these authors 
write: 


“.. the data suggest that most 

of the protein elements neces- 

sary for the construction of cells 
of the three superkingdoms, in- 
cluding eukaryote organelles were 
already expressed in the bottle- 
necked population that re-rooted 
the phylogenetic tree following a 
cataclysmic collapse of the bio- 
sphere. According to our phyloge- 
netic reconstructions, Bacteria and 
Archaea are not identifiable as an- 
cestors to eukaryotes. Instead they 
diverge from a common ancestor 
independently of the eukaryotes 

as highly specialized, fast growing 
unicellular organisms that have 
evolved efficient simplicity as the 
hallmarks of their cellular archi- 
tectures and to survive predation 
by their relatively complex eukary- 
ote cousins” [107]. 


In brief, at this point of the scenario, 
life had evolved into three domains of or- 
ganisms: phagocytic Protokarya; Archaea; 
and Bacteria. Protokarya phagocytes still 
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used their phagocytosis propensity to ex- 
plore and exploit their environment. It is 
known that modern Eukarya evolved from 
mitochondrion-bearing ancestors [108], at 
least most of them did (the highly diverse 
world of Protists still remains poorly ex- 
plored). Furthermore, many Eukarya today 
have created further endosymbiotic struc- 
tures with Bacteria, that they have engulfed 
at some point [109]. This is consistent 
with phagocytosis preceding the origin 
of the mitochondrion, which is of clear 
bacterial descent [110]. Another hypothe- 
sis — that Eukarya derived from a merger 
of a member of Bacteria and a member of 
Archaea — is not tenable, be it only because 
of the question raised by the divergent na- 
ture of their phospholipids [111]. However, 
the absence of ancestrally amitochondriate 
eukaryotes (archezoa) among extant eu- 
karyotes can neither be used as evidence 
for an archaeal host for the ancestor of the 
mitochondrion, nor as evidence against a 
eukaryotic host [112], so that it is not yet 
known whether there still exists a direct 
descent line from the Protokarya (i.e., or- 
ganisms with a nucleus and no remnants 
of mitochondria). The ongoing collection 
of planktonic protozoa might help to solve 
this riddle [113]. Still active today, the 
process of endosymbiosis occurred repeat- 
edly, in particular in algae and plants, 
where members of the cyanobacteria en- 
tered mitochondria-hosting cells and gen- 
erated the chloroplasts [114]. The process 
is still ongoing, with a variety of symbionts 
in insects (in particular), and the forma- 
tion of nodules with alpha-proteobacteria 
in leguminous plants [115]. The impor- 
tance of catastrophic bottlenecks during 
evolution, as pointed out by Kurland and 
coworkers, makes it difficult to recon- 
struct the scenario of the origin of extant 
cells. In particular, as noted by Rokas 


and Carroll [116], the question of diver- 
gent versus convergent evolution must 
be faced, in a world where HGT was 
the rule rather than the exception (and 
still is the rule in the Bacteria domain 
[19). 

Finally, it must be remembered that an- 
thropomorphism ensures that Eukarya are 
commonly seen as essentially multicellu- 
lar organisms, but this point cannot be 
discussed in depth here. The evolution of 
communities of cells forming a common 
differentiated structure evolved both in 
Bacteria (within the Myxobacteriales clade 
[117] and in amoeba such as Dictyostelium 
discoideum [118]. This corresponded to the 
emergence of a third functional genome, 
the histome (from iot06¢, tissue) besides 
the paleome and the cenome, which took 
into account the epigenetic regulation of 
gene expression allowing cells to display 
features specific of the different types of 
tissues. 


3.4 
Between Domains: the Perpetuation of 
Horizontal Gene Transfer 


The ancestors of Archaea and Bacteria 
did not remain isolated, and because 
of the way in which they had repli- 
cated their genome since the time of 
the RNA-genome world it was to be 
expected that they would be continu- 
ously exchanging genetic material. This 
exchange was achieved either by direct 
contact using membrane fusion/fission, 
by using progressively more specialized 
structures (conjugation), or by specialized 
DNA transport systems, when DNA be- 
came abundant because of its remarkable 
chemical stability. Direct contact between 
the cells created an opportunity for genes 
to explore the functional result of jumping 


from one site to another site, under con- 
ditions when the progressive clustering 
of genes on large DNA segments became 
favored [25]. It is therefore expected that 
structures using the catalytic properties 
of RNA for inserting within or splicing 
out of RNA played a seminal role in the 
process. RNA viruses with a lipid enve- 
lope were discovered during the process. 
The discovery of DNA was subsequently 
used via an emergence of reverse transcrip- 
tase to propagate these genetic segments 
within and between genomes, where they 
still stay [119]. 

In cells with a complex envelope, two 
related processes — conjugation, followed 
by the emergence of DNA viruses — 
organized the specific transfer of genetic 
material from a donor cell to various 
inhabitants of the same niche. The pro- 
cess of conjugation, with fairly wide tar- 
gets and elaborate pili structures, was 
selected to harpoon partners and trans- 
mit genetic material either as the full 
complement of the donor genome, as 
a primitive emergence of sex, or in 
pieces after the discovery of an au- 
tonomous replication initiation process, 
with short chromosome-like structures, 
plasmids, coding for their mobilization, 
and transfer as well as variable amounts of 
the host’s genes [120]. 

The conjugation machinery progres- 
sively became more specialized, with 
the generation of a variety of mutu- 
alistic/antagonistic interactions (incom- 
patibility groups, insertion in the host 
chromosome, etc.) and the creation of pro- 
tective/sticky structures that propagated a 
minimal replication machinery, generat- 
ing the ancestors of DNA viruses [121]. 
It was seen previously that the ancestors 
of RNA viruses had originated earlier, in 
the RNA-genome world where transcrip- 
tion and replication processes overlapped 
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on RNA segments coding for individual 
genes, using membrane vesicles as car- 
riers in Protokarya. They subsequently 
hitch-hiked the conjugation system in Bac- 
teria to overcome the barrier of the com- 
plex envelope, and propagated efficiently 
in this way [122]. In a similar trend of 
evolution, DNA viruses that are expected 
to have initially carried over segments of 
their host genome, sometimes large seg- 
ments (a propensity that they still have 
today, even in bacteriophages where it is 
used for transduction experiments, see, 
for example Ref. [123]), became progres- 
sively more streamlined. At the end of 
the process they carried the minimum 
set of genes which allowed them, by us- 
ing the cell gene expression machinery, 
to replicate and program synthesis of a 
protection capsid that was also essential 
for binding to host target cells and to dis- 
seminate. These two structure-dependent 
processes are still overlapping, as some 
bacteriophages can replicate as plasmids, 
rather than integrate as prophages in the 
host chromosome (many bacteriophages 
do not integrate in the chromosome when 
a prophage, but are transmitted to the 
progeny as a plasmid; e.g., Ref. [124)). 
Many combinations of these processes ex- 
ist, with “defective” or “satellite” viruses 
infecting cells that harbor fully functional 
viruses and using them as agents for their 
own propagation [125]. Other viruses use 
the conjugation organs as receptors, and 
it is likely that a wealth of further il- 
lustrations of the predator-prey duality 
will keep being discovered. The outcome 
of these HGT processes is prominent in 
extant genomes, and many are still ac- 
tive, with viruses and conjugation devices 
ubiquitously spread. This resulted in the 
presence of genomic islands in genomes, 
that are quite distinctive in terms of their 
nucleotide composition (they are often 
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A+T-rich) from the bulk of the genome. 
Many still have features showing their ori- 
gin (such as viral-specific genes found in 
prophages or proviruses in general). How- 
ever, some genomic islands such as those 
coding for polyketides or nonribosomal 
peptides are often not linked to obvious 
features, which provides a reminder of the 
way they were transferred and where they 
are now located. 

As a consequence of widespread HGT, 
the genomes of unicellular organisms 
are composed of two parts: the paleome 
(as discussed earlier), and the cenome, 
which corresponds to adaptation to a 
particular niche, with a variety of coded 
processes allowing individual organisms 
to belong to a consistent population [21]. 
As noted above, the late emergence of 
multicellular organisms displays a further 
specific gene set— the histome — which 
directs the epigenetic programming of cell 
differentiation. 


4 
Conclusion 


In the scenario proposed here, an at- 
tempt was made to show that compart- 
mentalization is a primitive trait, and 
differs from most other scenarios which 
usually assume that either Archaea or 
Bacteria preceded the birth of Eukarya. 
In contrast, it was proposed that Pro- 
tokarya — phagocytes which are expected 
to harbor a nucleus-like structure — created 
an opportunity for the appearance of Ar- 
chaea, with a different membrane, and 
Bacteria, with an envelope that could resist 
phagocytosis. Later, following the reduc- 
tive evolution of primitive bacterial sym- 
bionts in Protokarya, authentic Eukarya 
appeared, with a variety of protists, fungi 


and animals, followed by plants. An im- 
portant feature of this scenario, besides 
the need for compartmentalization, is that 
HGT was-and still is— pervasive. The 
main consequence of this situation is that 
there cannot be a single common ancestor 
to all living organisms. The unfortunately 
frequent adamite view of a Last Common 
Ancestor is a fiction, and this can be 
visualized as follows. An extant organism 
derives from a set of functions that have 
been encoded early on, often separately. 
Thymine synthesis, for example, may de- 
rive from two widely different pathways, 
while isoprene synthesis follows either the 
mevalonate or non-mevalonate pathway; 
the same is true for lysine, which may 
result from alpha-aminoadipate, or from 
diaminopimelate. Even more importantly, 
ATP synthase has a complex origin, 
obviously submitted to HGT [126]. There 
is a tree for the small subunit of ri- 
bosomal RNA, and this tree-—which is 
extended to the core proteins of informa- 
tion transfer and central metabolism - is 
often proposed to indicate that there is a 
unique root for extant organisms. How- 
ever, even in this case it is doubtful 
whether this is demonstrative since, even 
today, most bacteria (for example) display 
within a single genome sequences of ribo- 
somal DNA that are slightly different from 
one another [127]. Furthermore, the essen- 
tial protein-folding system that associates 
with the ribosome and controls the first 
steps of protein folding is not common 
to all three domains of life: the nascent 
polypeptide-associated complex (NAC) is 
common to Eukarya and Archaea, but 
differs in Bacteria (trigger factor). As a con- 
sequence, the origin of extant organisms 
is that of a population which underwent a 
series of expansions and bottlenecks [107], 
exactly as occurred during the course of 
many generations for the origin of Man, 


where it is impossible to claim the exis- 
tence of an Adam or an Eve. 
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Keywords 


Alternate splicing 

This occurs when after splicing, a single gene gives rise to more than one mRNA 
sequence. It may be due to the joining of exons in different series. Occasionally, 
HnRNA may splice differently (a portion of sequence may act as the intron in one case, 
and as the exon in another case). 


Alzheimer’s disease 

A neurodegenerative disorder that leads to the irreversible loss of neurons and dementia. 
The apparent symptoms are progressive impairment in memory, judgment, decision 
making, orientation to physical surroundings, and language. 


Attenuation 

A mechanism that controls RNA polymerase to read through an attenuator, an intrinsic 
termination sequence that is present at the start of the transcription unit. This type of 
control is present in some prokaryotic operons. 
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Bromodomain 

A protein domain of about 110 amino acids that recognizes acetylated lysine residues, 
such as those on the N-terminal tails of histones. This recognition is often a prerequisite 
for protein—histone association and chromatin remodeling. It is found in a variety of 
mammalians, invertebrates, and yeast DNA-binding proteins. 


CAAT box 

A conserved sequence located about 75 nucleotides upstream of the start point of 
transcription units. It is found in eukaryotes, and is also known the —75 box sequence. 
It is recognized by certain transcription factors, and has the consensus sequence 
GGCAATCT. The CAAT box plays an important role in increasing the promoter 
strength. 


Chromodomain 

A protein structural domain of about 40-50 amino acid residues, commonly found 
in proteins associated with the manipulation of chromatin. The domain is highly 
conserved among both plants and animals, and is represented in a large number 
of different proteins in many genomes. Some chromodomain-containing genes have 
multiple alternative splicing isoforms. 


Coffin—Lowry syndrome 

An X-linked dominant genetic disorder that causes severe mental problems. It 
is sometimes also associated with abnormalities of growth, cardiac abnormalities, 
kyphoscoliosis, as well as auditory and visual abnormalities. 


Cyclic AMP receptor protein (CRP or CAP) 

A regulatory protein activated by 3’,5’-cyclic AMP (cAMP). In prokaryotes, the 
transcription of many genes is activated after binding of this protein (in the form 
of a CRP—cAMP complex) at a specific site in the DNA. Two molecules of cyclic AMP 
bind with one molecule of CRP. 


Epigenetics 
The study of changes in phenotype (appearance) or gene expression, caused by 
mechanisms other than changes in the underlying DNA sequence. 


Epigenotype 
The stable pattern of gene expression outside the actual base pair sequence of DNA. 


Epigenetic regulation 

Cells in multicellular organisms are genetically homogeneous, but structurally and 
functionally heterogeneous, due to differential expression of genes, mostly during 
development. This differential expression is subsequently retained through mitosis. 
Stable alterations of this type are termed epigenetic regulation. 
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Exon 
A segment of interrupted gene having a coding region and present in mature mRNA. 


Gratuitous inducer 
A substance that induces the transcription of a gene(s), but is not a substrate for its 
enzyme protein product. Generally, it is an analog of the substrate, a normal inducer. 


Intron or intervening sequence 

A segment of interrupted gene found in eukaryotes. The intron is transcribed but does 
not code for a protein product. Intron sequence are removed during the maturation of 
primary transcript; this process is termed RNA splicing. 


Inducer 

A small molecule that triggers the biosynthesis of RNA by binding to the cytoplasmic 
repressor (the product of a regulatory gene). It is generally the substrate of the enzyme 
protein product of the structural gene. 


Induction 

The ability of bacteria or yeast to synthesize certain enzymes only when their substrates 
are present. The inducer binds to the cytoplasmic repressor, preventing it from binding 
to the operator region. If a cytoplasmic repressor is already bound with the operator; it 
becomes detached from the operator region after binding with the inducer. 


Lariat 
An intermediate that is formed during RNA splicing, where a circular structure with a 
tail is formed by a 5’, 2’ bond. 


Leader sequence 
A nontranslated sequence at the 5’ end of mRNA, preceding the initiation codon. 


Myoblast 
A type of progenitor cell that gives rise to myocytes. 


Operator 
A DNA sequence to which a cytoplasmic repressor (the protein product of a regulatory 
gene) binds specifically. 


Polyadenylation 
The addition of Poly A sequence to the 3’ end of an eukaryotic RNA (a post-transcriptional 
change). 
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Polycistronic mRNA 
An mRNA having the information for more than one protein. It is formed after the 
transcription of more than one gene present in a cluster (operon). 


Promoter 
The region of DNA involved in the binding of RNA polymerase to start RNA biosynthesis. 


Regulatory gene 
This gene codes for an RNA or a protein that controls the expression of other genes. 


Repression 

The inhibition of enzyme biosynthesis by a product of the metabolic pathway. Generally, 
inhibition is at the level of transcription. The product of the regulatory gene (cytoplasmic 
repressor) and the product of the metabolic pathway (corepressor) complex bind to the 
operator region on the DNA. 


Rett syndrome 

A neurodevelopmental disorder classified as an autism spectrum disorder. It was first 
described by the Austrian pediatrician, Andreas Rett, in 1966. The clinical features 
include a deceleration of the rate of head growth, and small hands and feet. Repetitive 
hand movements such as mouthing or wringing are also noted. 


Riboswitch 
A riboswitch is an mRNA that senses the environment directly, shutting itself down in 
response to particular chemical cues. 


Ribozyme 
RNA as an enzyme. Some RNA molecules are capable of self-RNA splicing without the 
involvement of any protein. This type of RNA is called a ribozyme. 


RITS (RNA-induced transcriptional silencing) 

A form of RNA interference by which short RNA molecules (viz., small interfering 
RNAs; siRNAs) trigger the downregulation of transcription of a particular gene or 
genomic region. RITS is generally accomplished by the post-translational modification 
of histone tails (e.g., methylation of lysine 9 of histone H3), which target the genomic 
region for heterochromatin formation. The protein complex that binds to siRNAs and 
interacts with the methylated lysine 9 residue of histones H3 is the RITS complex. 


SnRNAs (small nuclear RNAs) 
These are small RNAs present in the nucleus. They are considered to be involved in 
RNA splicing/other processing reactions. 
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SnRNPs 
These are small nuclear ribonucleoproteins. Within the SnRNPs, the SnRNAs are 
associated with proteins. 


Splicing 
The removal of introns and joining of exons in RNA. 


TATA box 

A conserved sequence found about 25 nucleotides upstream from the start point of 
the eukaryotic RNA polymerase II transcription unit. It is considered to be involved in 
positioning RNA polymerase II for correct initiation. 


Telomerase 
An enzyme resembling reverse transcriptase. The action of telomerase is to add 
telomeres to chromosome ends. 


Telomere 

A specialized structure at the ends of linear eukaryotic chromosomes. Telomeres 
generally have many tandem copies of a short oligonucleotide sequence, T,G, in one 
strand, and C,A, in the complementary strand (where a and b are on average 1-4). 


Totipotent 

Under appropriate conditions, a single cell divides and produces all of the differentiated 
cells in an organism. These cells are termed totipotent; the phenomenon is termed 
totipotency. 


Tropomyosin 

An actin-binding protein that regulates actin mechanics. It is important for muscle 
contraction. Tropomyosin, along with the troponin complex, associates with actin in 
muscle fibers and regulates muscle contraction by regulating the binding of myosin. 


Upstream 
The sequences found at the 5’ end of and beyond the region of expression. 


Gene expression can be regulated at the stage of transcription, RNA processing (post- 
transcriptional changes), and translation. In prokaryotes, the on-off of transcription 
serves as the main regulatory control of the gene expression whereas, in eukaryotes, 
more complex regulatory mechanism of transcription takes place. In addition, 
RNA splicing also plays a major role in the regulation of gene expression. The 
primary transcript of DNA has complementary sequences of both exons and 
introns, and is termed heterogeneous RNA (HnRNA). The HnRNA is spliced 
by the removal of introns and the ligation of exons. The regulation of gene 
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expression in both prokaryotes and eukaryotes is important, as it determines 
whether a particular protein should be synthesized, and in what quantity. The 
cells of a multicellular organism are genetically homogeneous, but structurally and 
functionally heterogeneous, owing to the differential expression of genes. Many of 
these differences in gene expression arise during development, and are subsequently 
retained through mitosis. Stable alterations of this type are termed epigenetic. These 
alterations are heritable in the short term, but do not involve mutations of the DNA 
itself. The main molecular mechanisms that mediate epigenetic phenomena are 
DNA methylation and histone modification(s). 


1 
Introduction 


The central dogma of gene expression 
is that “DNA makes the RNA, a process 
called transcription; and RNA makes the 
protein, a process known as translation” 
[1, 2]. Whilst in prokaryotes the cells do 
not have a distinct, well-defined nucleus, 
in eukaryotes the cells have a distinct, well- 
defined nucleus. Examples of prokaryotes 
include bacteria and blue green algae, 
while eukaryotes include animals, plants, 
and fungi [3]. In prokaryotes, the RNA 
primary product may itself be the target 
of regulation, whereas in eukaryotic cells 
— because of compartmentation — the 
transport of mRNA from the nucleus to 
the cytoplasm may serve as an additional 
target for regulation. Bacterial mRNA is 
directly available for protein biosynthesis 
soon after its synthesis, while the regula- 
tion of transcription usually occurs at the 
stage of initiation. At this point, it would 
be pertinent to mention that eukaryotic 
genes have been found to have both coding 
and noncoding sequences. In fact, as per 
the recently acquired human genome se- 
quence data, more than 50% of sequences 
are noncoding, the function of which is 
unclear. These sequences are referred 
to as introns or ‘junk sequences,” while 


the coding sequences are known as exons 
[4, 5]. Of course, regulatory elements also 
exist. 

In eukaryotes, the regulation of gene 
expression has been shown previously to 
occur mainly at the level of transcription. 
However, more recently such regulation 
has also been reported to occur signifi- 
cantly at the translational level. 

In the past, the ability to control the 
expression of genes in mammalian cells 
exogenously has served as a powerful tool 
in biomedical research [6, 7]. Indeed, gene 
regulation technology has played a key 
role in the efforts to understand the role 
of specific gene products in fundamental 
biological processes, in both normal de- 
velopment and disease states [7]. Clearly, 
an understanding of the role of gene reg- 
ulation within the context of an entire 
system, in relation to disease processes, 
would aid in the development of thera- 
peutic approaches. Similarly, a knowledge 
of gene expression and promoter control 
to be able to improve gene and protein 
networks, in conjunction with a knowl- 
edge of signal transduction, might help in 
the treatment of complex diseases. Conse- 
quently, attempts have been made in the 
following sections to describe all of the 
important aspects of gene regulation. 
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2 
Regulation of Gene Expression in 
Prokaryotes 


2.1 
Induction and Repression 


In prokaryotes, induction and repression 
— especially in the case of enzyme proteins 
— represent the most prominent means 
of regulation at the level of transcription. 
While certain proteins are synthesized at a 
constant rate at all times, other proteins — 
especially enzymes — are often produced 
in larger amounts when certain other ma- 
terials are present. This type of material, 
which generally is the substrate of the 
enzyme protein, will enhance the synthe- 
sis of the enzyme and is referred to as an 
inducer of that enzyme. Consequently, the 
enzyme is referred to as being an inducible 
enzyme, and the whole process is known 
as enzyme induction. It is not necessary 
for the inducer to be the substrate of the 
enzyme; in fact, the inducer may simply 
resemble the enzyme’s natural substrate, 
and need not necessarily be affected by 
it. An inducer that is not the natural 
substrate of the enzyme is termed a 
gratuitous inducer. In addition, if the genes 
expressing more than one enzyme are 
arranged in cluster, then a single inducer 
may induce expression of all the genes. 
Although an inducible enzyme is 
normally present only in trace amounts 
in a bacterial cell, its concentration can 
be rapidly increased (by 1000-fold or 
more) when its substrate is present in 
the medium. This is particularly the 
case when the substrate is the sole 
carbon source of the cell since, under 
these conditions, the induced enzyme is 
required to transform the substrate into a 
metabolite that can be utilized directly by 
the cell. One well-studied example of an 


inducible enzyme is B-galactosidase from 
Escherichia coli. Those E. coli cells with a 
wild-type B-galactosidase gene are unable 
to utilize lactose if glucose is also present 
in the medium. However, if only lactose 
is present as sole carbon source, or when 
the utilization of glucose is complete, 
the bacterial cells will synthesize the p- 
galactosidase enzyme and begin to utilize 
lactose within only a 1-2min period. 
Simultaneously, the cells will synthesize 
the enzyme f-galactoside permease, 
which is required for the transfer of f- 
galactoside inside the bacterial cell, as well 
as -thiogalactoside transacetylase. Sub- 
sequently, if the induced bacterial cells are 
transferred into a medium that is deficient 
in lactose, the synthesis of B-galactosidase 
(along with f-galactoside permease and 
B-thiogalactoside _transacetylase) will 
cease immediately and the previously in- 
duced enzyme will then decline to normal 
levels. The induction of a group of related 
enzymes or proteins to the same extent 
by a single inducing agent is termed co- 
ordinate induction. The case just described 
for E. coli, involving the induction of 
B-galactosidase, B-galactoside permease 
and -thiogalactoside transacetylase, is an 
excellent example of coordinate induction. 

Previously, two hypotheses were pro- 
posed in an attempt to explain the mech- 
anism of enzyme induction, using the 
B-galactosidase system: 


1. That an activation of pre-existing pro- 
tein occurred. 

2. That there was a de novo synthesis of 
the protein. 


The first possibility was discounted be- 
cause, prior to induction, no protein could 
be detected which had antigenic proper- 
ties similar to those of f-galactosidase. 
This observation suggested that the actual 


synthesis of the enzyme protein occurred 
following the addition of an inducer. At 
the time, it was realized that a specific 
protein would not be synthesized except 
in the presence of gene that would dictate 
its primary structure (ie., the amino acid 
sequence). Based on these observations, 
it was clear that the E. coli cells carried 
a structural gene for the A-galactosidase 
enzyme protein; hence, the question re- 
mained as to why enzyme protein syn- 
thesis did not occur in the absence of an 
inducer. Two explanations were proposed 
for this: 


1. The inducer might serve as a form 
of template that would trigger the 
enzyme protein synthesis (this may 
have been why the inducer resembled 
the substrate). 

2. The synthesis of the enzyme pro- 
tein may be inhibited by an unknown 
agent(s). In this case, the inducer itself 
might act as a form of inhibitor, which 
would in turn block the activity of the 
inhibiting agent(s). Although, initially, 
this possibility seemed very complex, 
subsequently obtained evidence sup- 
ported this hypothesis, and today this 
sequence of events is known to be 
correct. 


Similarly, in the presence of a material 
produced by a particular enzyme reaction, 
the synthesis of an enzyme protein may 
be reduced. This phenomenon is referred 
to as enzyme repression, and the material is 
known as a corepressor [1]. 

Induction and repression are two 
complementary phenomena. Generally, 
biosynthetic pathways have been found 
to be under the control of repression (e.g., 
the biosynthesis of many amino acids). 
For example, if histidine is added to the 
E. coli growth medium, all of the enzymes 
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involved in its biosynthesis will no longer 
be produced as the cells do not need to 
synthesize histidine. Consequently, in the 
presence of histidine, all of the enzymes 
required for its biosynthesis — starting 
from ATP phosphoribosyl transferase, 
which catalyzes the first reaction in the 
histidine biosynthetic pathway, namely 
the biosynthesis of phosphoribosyl ATP 
from  phosphoribosyl pyrophosphate 
(PRPP) and ATP - are repressed. This 
repression of the synthesis of a group of 
enzymes by a single corepressor — known 
as coordinate repression — is generally 
caused by the end product of the biosyn- 
thetic pathway (for this reason it may also 
be referred to as end-product repression). 


22 
The Operon 


The concept of the operon was first pro- 
posed in 1961 by Jacob and Monod. The 
suggestion was that genes encoding pro- 
teins with functions that were related (e.g., 
consecutive gene proteins in a pathway) 
may be organized into a cluster that, in 
turn, would be transcribed into a poly- 
cistronic mRNA from a single operator. 
The control of this operator would then 
allow the expression of the entire struc- 
tural genes in the operon to be regulated. 
This unit of regulation, which contained 
the structural gene, regulator gene(s), and 
the cis-acting elements, was referred to as 
the operon. Furthermore, the genes could 
be classified into two groups, depending 
on their coded protein functions: 


e Structural genes, the protein products of 
which are directly involved in metabolic 
activity (when they act as enzymes), or 
they may serve as the constituents of an 
organelle. 
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e Regulatory genes, the protein products of 
which regulate the transcription of the 
structural genes. 


The activity of an operon is controlled 
by the regulatory gene(s), the protein 
product(s) of which interact(s) with the 
control elements. Although many oper- 
ons have been examined in detail, one of 
the best-studied examples is the lac (lac- 
tose) operon of E. coli. In this operon, 
lac i is the regulatory gene, the protein 
product of which (known as a cytoplas- 
mic repressor) is involved solely in regu- 
lation, whereas lac Z, lac Y, and lac A 
are structural genes that code for the en- 
zymes, B-galactosidase, 6-galactoside per- 
mease, and f-thiogalactoside transacety- 
lase, respectively. Details of the lac 
operon are provided in the following 
subsections. 


2.2.1. The Lactose Operon (lac Operon) 

If E. coli cells are grown in a medium 
containing lactose as a sole carbon and 
energy source, the cells will synthesize the 
enzymes, $-galactosidase (which catalyzes 
the hydrolysis of lactose into glucose and 
galactose), 6-galactoside permease (which 
is involved in the entry of lactose into 
the bacterial cell), and f-thiogalactoside 
transacetylase (which catalyzes the trans- 
fer of an acetyl group from acetyl CoA onto 
the 6-position of 6-thiogalactoside, to gen- 
erate 6-acetyl-£-thiogalactoside). Whilst 
glucose is easily metabolized and enters 
the glycolytic pathway directly, galactose 
must first be converted into glucose be- 
fore such entry can be made. Studies with 
the lac operon have shown it to consist of 
three adjacent structural genes, lac Z, lacY, 
and lac A; preceding these genes are lac 
O, lac P, and lac i, which have regulatory 
roles. Typically, laci codes for a protein that 
serves as a cytoplasmic repressor, lac P is 


the promoter site onto which RNA poly- 
merase binds, and lac O is the controlling 
site onto which the cytoplasmic repressor 
binds. Following binding, the transcrip- 
tion of the structural genes is switched 
off (Fig. 1). All three structural genes 
are transcribed as a single polycistronic 
messenger RNA carrying genetic informa- 
tion for the three enzyme proteins. Be- 
sides lactose, many other analogs, includ- 
ing isopropyl-f-thiogalactoside (IPTG), 
methyl-6-thiogalactoside-, and mellibiose, 
also act as inducers. In vitro, IPTG is the 
most commonly used inducer of the lac 
operon [8]. 

The genes located in the lac operon are: 


e Gene Z* (lac Z*): in mutant condition, 
this gene results in a loss of the ability 
to synthesize active f-galactosidase, 
either in the presence or absence of an 
inducer. 

e Gene Y* (Jac Y*): in mutant condition, 
this gene results in a loss of the 
ability to synthesize active 6-galactoside 
permease, either in the presence or 
absence of inducer. 

e Gene A* (lac A*): in mutant condition, 
this gene results in a loss of the ability 
to synthesize active f-thiogalactoside 
transacetylase, either in the presence 
or absence of inducer. 

e Gene i* (lac i*): this gene causes 
changes in the influence of the 
inducer on the synthesis of #- 
galactosidase, B-galactoside permease, 
and f-thiogalactoside transacetylase. 


Many i mutants synthesize large 
amounts of the enzymes in the absence 
of an inducer. By using different 
combinations of wild-type and mutant 
genes of lac Z, lac Y, and lac i genes, 
different types of genetic structure of the 
lac region of the E. coli chromosome may 
be assumed (Table 1). 
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Fig. 1 Line diagram of the lac operon. 

P, promoter; O, operator; lac i, regulatory 
gene; lac Z, Y, and A, structural genes for B- 
galactosidase, B-galactoside permease, and 
B-thiogalactoside transacetylase, respectively. 
RNA polymerase binds on the promoter site, 
while the cytoplasmic repressor binds on the 
operator site and represses transcription of 
the structural genes (negative control). In the 


Regulatory gene(s) need not necessar- 
ily be located close to the structural 
genes; rather, the regulatory action(s) is 
(are) due to the biosynthesis of various 
intracellular substance(s). The study of 
mutations of the regulatory genes have 
provided insights into the main mecha- 
nism of induction and repression. The 
E. coli cells containing lac i* produce f- 
galactosidase only in the presence of an 
inducer, whereas cells containing the mu- 
tated lac i (referred to as laci—) can produce 
B-galactosidase both in the presence and 
absence of an inducer. These findings in- 
dicate that lac i* is dominant, while lac i~ 
is recessive. 

The regulatory gene, lac i*, codes for 
a cytoplasmic repressor (also called lac 


presence of the inducer, there is formation 

of an inducer—cytoplasmic repressor com- 
plex, which is not capable of binding on the 
operator site. If the cytoplasmic repressor is 
already bound to the operator site, it becomes 
detached from there after forming a complex 
with the inducer, and this results in transcrip- 
tion of the structural genes. 


repressor). The lac repressor was first iso- 
lated by Walter Gilbert and Beno Muller 
Hill in 1966, and is a tetrameric pro- 
tein with four identical subunits each of 
molecular weight ca. 37 000 Da that binds 
specifically to the operator region. Each 
of the subunits is formed by a chain of 
347 amino acids, in which the N-terminal 
amino acid is methionine and the C- 
terminal amino acid is glutamine. The 
tetrameric protein may be dissociated in 
the presence of sodium dodecyl sulfate. 
Each of the subunits has one binding site 
for the inducer; the subunits are also able 
to bind to an inducer, but not to the opera- 
tor region. The binding constant for IPTG 
has been calculated as approximately 10~° 
M. The cytoplasmic repressor binds very 
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Tab. 1 


Various genotypes of the E. coli lactose system. 


Genotype Noninducible 


Inducible 


B-Galactosidase 


B-Galactoside 


B-Galactosidase B-Galactoside 


permease permease 
ZY Yi<cil - - 4 re 
Ze<—-Yi<il _ _ _ + 
Zi<Ye<il - _ 4 _ 
Ze < Ye<ill _ _ _ _ 
ZI YI<ie + + + + 
Ze <—Y |< ie — + - + 
Z I< Ye— <ie + - + - 


Ze < Ye < ie = 


tightly to the operator region, the equilib- 
rium constant of the complex being about 
10-13 M; the rate constants for associa- 
tion and dissociation are 7 x 10? M7! s~! 
and 6 x 10-*s~!, respectively. Although 
the content of the various amino acids is 
normal, the tryptophan content is com- 
paratively low, with only two tryptophanyl 
residues among the 347 amino acids of 
each subunit. The OD 9! ™8™* is 0.59 
(a comparatively low value due to the low 
tryptophan content) [9]. 

By employing techniques of circular 
dichroism (CD) and optical rotary dis- 
persion, estimates have been made of 
the a-helix content (ca. 33-40%) and p- 
structure (18—42%) of the subunits. Based 
on the Chou—Fasman model, and the pri- 
mary sequence, predictions of 37% a-helix 
content and 35% f-structure have been 
made for the subunits. 

The use of electron microscopy (af- 
ter negative staining) and powder X-ray 
diffraction (XRD) analysis indicated the 
tetramer to have an asymmetric dumb- 
bell shape, with dimensions of about 
45 x 60 A, and with four tetramers being 
contained in one unit cell of 91 x 117 A. 


Subsequent powder XRD analyses indi- 
cated third unit cell dimension (which 
is not seen in electron micrographs) to 
be 140 A. Four tetramers could be easily 
packed into this cell, in a manner which ac- 
counted for the stain distribution observed 
on electron microscopy. The molecule was 
shown to extend the full length of the 140 
A cell, and to cause the tetramer to have an 
elongated shape with molecular dimen- 
sions of approximately 140 x 60 x 45 A. 
On the basis of these data, a model was 
proposed in which the subunits are related 
by 222 symmetry and placed at the corners 
of a rectangular plane. This suggests the 
existence of two operator binding sites per 
tetramer, if the repressor were to maintain 
perfect 222 symmetry [9]. 

The shape of the lac repressor in solution 
appears quite different, however, with the 
tetramer appearing as a square structure 
with dimensions of approximately 105 x 
95 A. Moreover, neither the shape nor the 
dimensions of the molecule were changed 
in the presence of IPTG. Although the 
subunits could be distinguished within the 
tetrameric structure, the poor resolution 
of the system meant that no decision 


could be made on the geometry of their 
arrangement. 

By using X-ray crystallography, the lac 
repressor has been shown to consist of 
three distinct regions: (i) a core region 
which binds allolactose; (ii) a tetrameriza- 
tion region which joins four monomers 
in an a-helix bundle; and (iii) a DNA- 
binding region having a helix—turn—helix 
structural region that binds the operator 
site. The tetrameric lac repressor may be 
viewed as two dimers, each of which is 
capable of binding to a single lac operator. 
In turn, the two subunits each bind to a 
slightly separated major groove region of 
the operator. 

It would appear that two different types 
of binding site should be present in the 
tetrameric structure — the first for the low- 
molecular-weight effectors, and the second 
for the lac operator. There are indications 
that both inducers and anti-inducers bind 
to the same site, or at least to overlap- 
ping sites; O-nitrophenyl 6-p-fucoside has 
been found to act as anti-inducer for the 
lac operon. Similarly, the operator bind- 
ing site involves the same region of the 
tetrameric repressor that binds nonoper- 
ator DNA. Each repressor subunit seems 
to have an effector binding site, besides 
contributing some interactions with opera- 
tor DNA. Based on experimental evidence, 
it has been concluded that the effector 
binding site and operator binding site 
are distinct and nonoverlapping. Likewise, 
based on the results of trypsin-limited di- 
gestion studies (which split 59 N-terminal 
residues and 20C-terminal residues of 
each subunit, leaving a tetrameric trypsin- 
resistant core protein composed of sub- 
units having 60 to 327 residues), it has 
been shown that the N- and C-terminal 
residues are not required for either the 
binding of an inducer, nor for folding 
of the subunits into a correct tetrameric 
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conformation. However, it could not be 
confirmed whether the interaction be- 
tween the subunits of the core was iden- 
tical to the interaction of the subunits in 
the native repressor. Whilst there are indi- 
cations that terminal regions are involved 
in operator binding, there are also indi- 
cations that effector binding changes the 
affinity of the repressor for its operator. 
Both, the N- and C-terminal regions have 
hydroxyl group containing amino acids 
(threonyl residues at positions 5, 19, 34, 
315, 316, 321, and 323; seryl residues at 
positions 16, 21, 28, 31, 309, 325, 332, 341, 
and 345) which could contribute their hy- 
droxyl groups of the side chains to form 
hydrogen bonds with specific groups of 
the bases in the lac operator. Among 
a total of eight tyrosyl residues in the 
chain, there were four in the 59-residue 
N-terminal region (at positions 7, 12, 17, 
and 47), which offers the possibility of in- 
teractions with DNA either by providing 
additional hydroxyl groups or by intercala- 
tion between the bases. In addition, eight 
positively charged amino acids have been 
found in the 59-residue N-terminal region, 
and six positively charged amino acids in 
the 36-residue C-terminal region. Thus, a 
higher content of positively charged amino 
acids is present in the terminal regions 
(14.7%) compared to the total percent- 
age in the molecule (10.7%). It has been 
considered that a combination of the N- 
and C-terminal regions constitutes a basic 
region which, through electrostatic inter- 
actions, makes contact with the negatively 
charged DNA and contributes to its spe- 
cific binding to the operator region. The 
region between 215 and 324 amino acyl 
residues has few charged residues (only 
three each positive and negatively charged 
residues), and is enriched in hydrophobic 
residues; hence, it may be involved in the 
stabilization of both the native tetramer 
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and its tryptic core. This region may serve 
as a hydrophobic nucleus that is resistant 
to trypsin attack [9]. 

The lac repressor can initiate four types 
of interaction: 


1. Specific interaction between the lac 
repressor and its operator. 

2. Nonspecific interaction between the lac 
repressor and any DNA. 

3. Specific interaction between the lac 
repressor and its low-molecular-weight 
effectors, which include inducers and 
anti-inducers. 

4. The effector (inducer) may also interact 
with the lac repressor that is already 
bound to the operator, to form an 
intermediate ternary complex. 


One molecule of the inducer IPTG was 
found to be sufficient to release the lac 
repressor from its specific operator. Af- 
ter binding of the inducer, an almost 
1000-fold decrease was apparent in the 
affinity of the repressor for its operator. 
The lac repressor shows a single emission 
maximum at 338 nm in the fluorescence 
spectrum, which is characteristic of tryp- 
tophan. However, on the addition of an 
inducer at saturating concentrations, a 
shift in the emission maximum to 330 
nm occurs, but with no change in the 
peak shape or fluorescence intensity. The 
change in emission maximum suggests 
that at least one tryptophanyl residue per 
subunit has become less accessible to 
the solvent upon inducer binding. The 
absence of any change in the shape of 
the fluorescence spectrum may indicate 
that either both tryptophanyl residues of 
the subunit are similarly affected, or that 
only one contributes to the change in the 
emission maximum. At least two sequen- 
tial steps appear to be involved in the 
binding of IPTG to the lac repressor: (i) 


a bimolecular step, which is much slower 
than would be expected for a diffusion- 
controlled reaction; and (ii) a monomolec- 
ular step, which may be attributed to 
a conformational change in the protein. 
Subsequent CD studies showed that no 
major changes had occurred in the overall 
geometry of the peptide backbone of the 
repressor upon binding of the inducer, 
while sedimentation coefficient studies 
indicated a compactness in the protein 
molecule upon inducer binding. Glycerol 
perturbation spectra indicated that fewer 
aromatic residues are available to the sol- 
vent in the presence of the inducer than 
in the protein alone, or in the presence of 
anti-inducers. It has been predicted that 
the repressor undergoes a conformational 
change upon binding to its operator, and a 
major change in induction may be taking 
place upon interaction of the inducer with 
the repressor—operator complex. 

When, subsequently, the lac repressor 
became available in pure form, it was em- 
ployed to isolate the operator region. For 
this, DNA with a lac region was first frag- 
mented into units each of almost 1000 
nucleotides; the lac repressor was then 
added to the fragmented mixture and, af- 
ter incubation, the reaction mixture was 
filtered through a cellulose nitrate mem- 
brane. Those DNA fragments without the 
bound lac repressor passed through the 
filter, whereas those with the bound lac 
repressor remained tightly bound to the 
filter membrane. The bound DNA was 
released from the filter by adding IPTG, 
after which the lac repressor was added 
and the fragments treated with deoxyri- 
bonuclease (DNase). The operator region 
was protected against digestion by DNase 
after binding the lac repressor, whereupon 
a nucleotide sequence determination re- 
vealed that the repressor had protected 
a total of 27 nucleotides. Moreover, this 


sequence has a dyad symmetry that is im- 
portant for the specific binding of the lac 
repressor with its operator. The symmet- 
rical sequence of the lac operator region is 
as follows: 


5’/TGGAATTGTGAGCGGATAACAATT3’ 
3‘ACCTTAACACTCGCCTATTGTTAAS’ 


It has also been shown that allolactose 
may serve as an inducer for the lac operon. 
As B-galactosidase is able to convert lac- 
tose into allolactose, the genes necessary 
for this conversion are under control of 
the lac promoter. It has been shown that, 
if the number of repressor molecules in 
a bacterium is sufficiently low, a small 
proportion of the cells will have insuffi- 
cient cytoplasmic repressor to inhibit the 
transcription. Consequently, with time, an 
increasing number of cells in the culture 
will (transiently) have no lac cytoplasmic 
repressor and will express the lac operon 
such that, under these conditions, lactose 
will be converted into allolactose. The latter 
material will then bind to the cytoplasmic 
repressor, resulting in an increase in the 
expression of the genes of the lac operon. 
Moreover, this induced state is epigenetic 
and somewhat heritable [1, 9, 10]. 


2.2.2. The Histidine Operon 

The histidine operon is an example of 
enzyme repression. In Salmonella, most of 
the structural genes encoding the enzymes 
required for histidine biosynthesis are 
arranged in the same order as the sequence 
of chemical reactions catalyzed by the 
respective enzymes, except in one or two 
cases. The sequence of the genes in the 
histidine operon is as follows, where O 
denotes its operator: 


EIFAHBCDGO 
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In total, there are nine genes that spec- 
ify the structure of nine proteins involved 
in histidine biosynthesis. The biosynthe- 
sis of histidine, starting from PRPP and 
ATP, along with the genes coding for the 
enzymes involved, is shown in Fig. 2. Two 
enzymes in the pathway have been found 
to be bifunctional. As written above, with 
the exception of one or two genes, the ar- 
rangement of genes on the chromosome is 
related to their position in the pathway in 
vivo. This indicates that the chromosome 
contains a remarkable amount of informa- 
tion, not only of the sequence of the amino 
acids in the enzyme proteins but also re- 
garding the metabolic pathway catalyzed 
by these enzymes [1]. 


2.2.3. The Tryptophan Operon 

The tryptophan operon consists of five 
structural genes that code for the enzymes 
involved in the biosynthesis of trypto- 
phan. The latter process, in addition to 
enzyme repression, is also controlled by 
feedback inhibition, with tryptophan in- 
hibiting the activity of the first enzyme 
that is unique to the tryptophan biosyn- 
thetic pathway. However, the tryptophan 
operon is also controlled by attenuation; 
a line diagram of the tryptophan operon, 
showing the structural genes, regulatory 
genes and other regulatory elements, is 
shown in Fig. 3. The regulatory gene — 
which is referred to as trp R — produces 
the cytoplasmic repressor, a dimeric pro- 
tein with two identical subunits each of 
107 amino acids and a molecular weight 
of almost 12 500 Da. As in all cases of 
enzyme repression, in the absence of tryp- 
tophan (as corepressor) the cytoplasmic 
repressor does not bind with the opera- 
tor region. In the presence of tryptophan, 
a cytoplasmic repressor—corepressor com- 
plex is formed that binds with the op- 
erator region which, in turn, is partially 
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a - ATP phosphoribosyl transferase 
b - Pyrophosphohydrolase 


c - Phosphoribosyl adenosine monophosphate cyclohydrolase 
d - Phosphoribosyl formimino 5-amino-imidazole-4-carboxamide ribonucleotide 


isomerase 
e - Glutamine amido transferase 
f - Imidazole glycerol-3-phosphate dehydratase 
g - L-Histidinol phosphate amino transferase 
h - Histidinol phosphate phosphatase 
i - Histidinol dehydrogenase 


Fig. 2. The histidine biosynthetic pathway. The capital letters 
on the arrows denote the genes that encode the enzyme 


catalyzing the biochemical steps. 


overlapped with the promoter region. The 
points contacted by the cytoplasmic re- 
pressor lie symmetrically and occupy the 
region from positions —23 to —3. The 
operator has a region of dyad symme- 
try, which also includes the consensus 
sequence of the promoter at —10. As a 
result, RNA polymerase is incapable of 
binding with the promoter region, thereby 
repressing transcription of the structural 
genes. It is clear that the different needs 
of induction and repression are accom- 
plished in an almost similar manner, the 
difference being that the effector molecule 
modulates the operator binding specificity 


of the cytoplasmic repressor in a different 
way (1, 3, 11]. 

In the case of the tryptophan operon, 
a deprivation of tryptophan results in 
an approximately 70-fold increase in the 
frequency of initiation events at the tryp- 
tophan promoter. Moreover, even under 
repressing conditions transcription of the 
structural genes remains at a low level. In 
the case of the lac operon, the basal level of 
synthesis is only about one-thousandth of 
the induced level. This indicates that the 
efficiency of repression in the tryptophan 
operon is much lower than that seen in 
the lac operon [1]. 
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Fig. 3. Line diagram of the tryptophan operon. trp R, regulatory gene; P, promoter; O, operator. Between O and trp 

E is the leader sequence used in attenuation control; trp E codes for anthranilate synthase component |, and trp D 

for component II. The components | and II, on combination, form active anthranilate synthase. trp C codes for N-(5/- 
phosphoribosyl)-anthranilate isomerase-indole-3-glycerol phosphate synthase; trp B codes for the 6 subunit of trypto- 
phan synthase; and trp A for the a subunit of tryptophan synthase. The a2f2 complex forms active tryptophan synthase. 
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2.2.4 The Arabinose Operon (ara Operon) 

The arabinose (ara) operon in E. coli con- 
sists of three structural genes that code for 
the enzymes involved in the utilization of 
arabinose (the bacterium can utilize arabi- 
nose as a carbon source). The ara operon 
is an example of both positive and nega- 
tive control; a line diagram of the operon, 
showing the structural genes, regulatory 
gene, and regulatory elements, is shown 
in Fig. 4. The product of the ara C (regu- 
latory) gene is referred to as Ara C protein, 
the biosynthesis of which is self-regulatory 
after binding with the ara O, operator and 
repressing ara C gene transcription. In 
general, the cell contains about 40 copies 
of the Ara C protein, and it acts as a positive 
and negative regulator for transcription of 
the structural genes, ara B, ara A, and ara 
D, which in turn code for t-ribulose kinase, 
L-arabinose isomerase, and t-ribulose-5- 
phosphate epimerase, respectively. Some 


CRP binding site 


araO, araO,| aval 


regulatory DNA sequences exert their ef- 
fect from a distance; these sequences are 
not always contiguous with the promoters, 
with distant DNA sequences being made 
closer via DNA looping mediated by spe- 
cific protein-protein and protein-DNA 
interactions [3]. 

When glucose is present and arabinose 
absent, the Ara C protein binds to both ara 
O2 and ara I, forming a DNA loop of about 
210 nucleotides. Under these conditions, 
transcription of the structural genes is re- 
pressed. In contrast, if glucose is absent 
and arabinose present, then cyclic AMP 
(cAMP) and cyclic AMP receptor protein 
(CRP) become abundant, such that a com- 
plex of cAMP and CRP binds to its site adja- 
cent to ara 1. Arabinose then binds with the 
Ara C protein, altering its conformation; 
this binding causes the DNA loop to be 
opened, while the Ara C protein bound to 
ara | acts as activator and, in concurrence 


DNA 


araC 


araB JaraA 


araD 


[Es] 


Po 


L-Ribulose 
kinase 


Pgap 


ara BAD mRNA 


L-Arabinose | |L-Ribulose-5-P 
isomerase epimerase 


TA 


L-Arabinose—~>L-Ribulose —> L-Ribulose-5-PO, —> b-Xylulose-5-PO, 


Fig. 4 Line diagram of the arabinose operon. 
ara C, regulatory gene; ara O2, ara O; and 
ara |, regulatory elements to which the ara C 
gene product may bind; Pc, promoter for ara 
C gene; Pgap, promoter for BAD genes; ara 
B, ara A and ara D, structural genes for L- 
ribulose kinase, L-arabinose isomerase, and 
L-ribulose-5-phosphate epimerase, respectively. 
The ara C protein regulates its own synthe- 
sis after binding on ara Oj, resulting in a 
repression of transcription of the ara C gene. 
The ara C protein acts as positive as well as 


negative regulator for ara BAD genes. If ara- 
binose is absent and glucose present, ara C 
protein binds with ara O2 as well as ara | to 
form a DNA loop, and there is repression of 
the ara BAD genes. If arabinose is present, 
cAMP—CRP becomes abundant and binds to 
the site adjacent to the ara | site (CRP-binding 
site). Arabinose also binds to the ara C pro- 
tein, altering its conformation, the DNA loop 
is opened; the ara C protein bound on ara | 
then acts as an activator for the transcription 
of the ara BAD genes. 


with the cAMP-—CRP complex, induces 
transcription of the structural genes. 
Finally, if arabinose and glucose are both 
present, then a repression of transcrip- 
tion will occur, possibly due to catabolite 
repression caused by glucose [1, 3]. 


2.3 
Positive and Negative Control 


Positive and negative control systems can 
be distinguished on the basis of the mode 
of action of the cytoplasmic repressor. 
Genes under the negative control are un- 
able to be transcribed in the presence of 
the product of the regulatory gene (cyto- 
plasmic repressor), but will be transcribed 
in its absence. This indicates that the cy- 
toplasmic repressor switches off the tran- 
scription, either by binding to the DNA 
to prevent RNA polymerase from initi- 
ating transcription, or by binding to the 
mRNA to prevent a ribosome from ini- 
tiating translation. In fact, such negative 
control provides a fail-safe mechanism. 
The lac operon, as described above, repre- 
sents an example of negative control. 

The tryptophan operon described above 
represents another example of negative 
control, as the level of tryptophan in the cell 
regulates both the activity and generation 
of the tryptophan-synthesizing enzymes. 
Moreover, as tryptophan inhibits the ac- 
tivity of the first enzyme of the synthetic 
pathway, it will also inhibit the synthe- 
sis of further tryptophan. Tryptophan may 
also act as corepressor that activates the 
product of the trp R gene. In the presence 
of tryptophan, the tryptophan operon is 
repressed by binding of the cytoplasmic 
repressor (the product of the trp R gene) — 
tryptophan complex to the operator region. 

Genes under the positive control are ex- 
pressed only when an active regulatory 
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protein is present. This regulatory pro- 
tein acts to switch on transcription, and 
is thus an activator protein; such activa- 
tor proteins are also referred to as positive 
control factors. The regulatory protein inter- 
acts with DNA and with RNA polymerase 
to assist the initiation. A positive control 
factor that responds to a small molecule is 
known as an activator. Unfortunately, the 
activator alone cannot bind to the operon; 
rather, it requires another molecule to 
be bound to the activator protein, which 
in turn increases the DNA-binding abil- 
ity. An example of this is cAMP-activated 
CRP which activates the arabinose operon, 
which is an example of both negative and 
positive control [1, 11, 12]. 


2.4 
Attenuation: The Leader Sequence 


In the regulation of amino acid operons, 
it is generally the end product (amino 
acid) that acts as a corepressor to repress 
transcription of the structural genes. On 
the basis of the mechanism of enzyme 
repression, it was considered originally 
that a regulatory gene-deleted operon or 
operon having a mutant regulatory gene 
should not be under transcriptional re- 
pression. However, in already derepressed 
trp R~ mutants, tryptophan synthesis can 
be stimulated by the deprival of trypto- 
phan, and also by an internal deletion of 
the region between the operator and the 
first structural gene. Based on these find- 
ings, a second mechanism of regulation 
that involved a variable, premature termi- 
nation of transcription in this region was 
elucidated; this process was termed attenu- 
ation. On analysis of the early mRNA of the 
tryptophan operon, a part of the sequence 
was found to code for a short leader pep- 
tide, with any variation in translation of 
the leader peptide being dependent on the 
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supply of tryptophan. The latter material 
influences the frequency of termination of 
transcription at the attenuator site, which 
lies still further ahead. 

The process of attenuation controls the 
ability of the RNA polymerase to read 
through an attenuator — an intrinsic ter- 
minator located at the beginning of a 
transcription unit. The common feature 
of attenuator systems from different oper- 
ons is that some external event controls 
the formation of the hairpin required for 
intrinsic termination. Typically, if the hair- 
pin is allowed to form, then termination 
will prevent RNA polymerase transcribing 
the structural genes. However, if the hair- 
pin is prevented from forming, then RNA 
polymerase will elongate through the ter- 
minator such that the structural genes are 
expressed. Control by attenuation requires 
a precise timing of the events that con- 
trol termination. For example, translation 
of the leader peptide must occur at ex- 
actly the same time that RNA polymerase 
approaches the terminator site. The RNA 
polymerase will then remain paused un- 
til translation of the leader peptide occurs 
on the ribosome. Subsequently, the RNA 
polymerase is released and moves toward 
the attenuation site. In providing a mech- 
anism to sense the inadequacy of the 
supply of Trp-tRNA, attenuation is able 
to respond directly to the needs of the 
cell for tryptophan in protein biosynthesis, 
and also employs attenuation as a control 
mechanism [1]. 

In the case of the tryptophan operon, 
the attenuator lies within the transcribed 
leader sequence of 162 nucleotides that 
precedes the initiation codon of trp E. It 
has a rho-independent termination site, 
and is a barrier for transcription, while 
a short GC-rich palindrome sequence is 
followed by eight successive U residues. 
RNA polymerase terminates at this site, 


producing only a 140 nucleotide mRNA. 
The leader region sequence contains a 
ribosome-binding site, and has an open- 
reading frame for coding a peptide of 14 
amino acids called a leader peptide, which is 
unstable and has the following sequence: 


Met — Lys — Ala — Ile — Phe — Val 


Gly — Trp — Trp — Arg — Thr 


Leu 


Lys Ser 
It is clear from the sequence of the leader 
peptide that, among the 14 amino acids 
present, two are tryptophan. As tryptophan 
is considered to be a rare amino acid 
in proteins, its abundance in the leader 
peptide has a certain significance. For 
example, when the amount of tryptophan 
in the cell is deficient, the biosynthesis 
of the leader peptide on the ribosome 
will be stopped when the trp codons 
are reached. The sequence of the mRNA 
suggests that this “ribosome stalling” 
may in turn influence termination at 
the attenuator. Pairing of the regions 
generates the hairpin that precedes the 
oligo U sequence, which is a termination 
signal for transcription. The position of 
ribosome can determine which structure 
is formed; for example, when tryptophan 
is deficient in the cell the ribosomes will 
stall at the trp codons, which form part 
of region 1. Consequently, region 1 will 
be sequestered within the ribosome and 
cannot base-pair with region 2. Under 
these conditions region 2 will base-pair 
with region 3, thus compelling region 
4 to remain in single-stranded form. In 
the absence of a terminator hairpin, RNA 
polymerase will continue transcription 
after the attenuator. When tryptophan is 
present in the cell, the biosynthesis of 
the leader peptide occurs through the trp 
codons and continues along the leader 
section of the mRNA to the UGA codon, 
which lies between regions 1 and 2. 
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Fig. 5 Ribosomal stalling. (a) When trypto- 
phan levels are low, biosynthesis of the leader 
peptide on the ribosome becomes paused 

at the trp codons in region 1. The region 1 
sequence is then sequestered within the ri- 
bosome and cannot base-pair with region 2; 
therefore, region 2 base-pairs with region 3, 
compelling region 4 to remain in a single- 
stranded form; (b) When tryptophan levels are 


Under these conditions, the ribosomes 
will extend over region 2, preventing it 
from base-pairing with region 3. At this 
point, region 3 remains available to base- 
pair with region 4, thus generating a 
hairpin that results in a termination of 
transcription at the attenuator (Fig. 5) [1, 3]. 

Regulation via an attenuation mecha- 
nism has been identified in many amino 
acid operons of E. coli, for example, His, 
Phe, Leu, Thr, and ilv. 


ZS 
Catabolite Repression 


Glucose is the most easily utilizable sugar 
for energy purposes, and is therefore 
preferred by E. coli as a carbon source. If 


high, biosynthesis of the leader peptide occurs 
through trp codons. Synthesis continues along 
the leader section of the mRNA to the UGA 
codon present between regions 1 and 2. The 
ribosome extends over region 2 and prevents 
it from base-pairing with region 3; region 3 
then base-pairs with region 4, generating a ter- 
mination hairpin. 


the bacterial cells are grown in a medium 
containing both glucose and lactose, there 
is no induction of the lac operon. How- 
ever, when the utilization of glucose is 
complete, the cells will begin to utilize 
lactose such that the lac operon will be 
induced to initiate the biosynthesis of B- 
galactosidase. In the presence of glucose, 
the lactose operon will not be induced; 
in this case, the inhibitory molecule is 
not glucose itself but rather is an un- 
known catabolite that is derived from 
glucose and functions by preventing the 
expression of several operons, including 
those of lactose, galactose, and arabi- 
nose. Collectively, this effect is referred 
to as catabolite repression/carbon catabolite 
repression. 
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Catabolite repression is generally 
mediated by several mechanisms which 
can either affect the synthesis of catabolic 
enzymes via global or specific regulators, 
or inhibit the uptake of a carbon source 
and result in a decline of the correspond- 
ing inducer. The phosphoenolpyruvate 
(PEP):carbohydrate phosphotransferase 
system (PTS) and protein phosphorylation 
play a major role in catabolite repression. 
The PTS components form a protein 
phosphorylation cascade which uses PEP 
as the phosphoryl donor. Most of the 
PTS-mediated catabolite repression mech- 
anisms respond to the phosphorylation 
level of a PTS protein that is controlled by 
the metabolic state of the cell. 

In E. coli, an important enzyme of the 
PTS system — enzyme IIA (EIIA) — plays 
an important role in this mechanism. In 
E. coli, EIIA is specific for glucose trans- 
port such that, when glucose levels are 
high inside the bacterial cell the enzyme is 
present mostly in its nonphosphorylated 
form, and this leads to an inhibition of 
adenylyl cyclase. In contrast, recently ac- 
quired genetic data have suggested that 
the adenylyl cyclase enzyme is stimulated 
by phosphorylated EIIAS'*. Indeed, a di- 
rect correlation has been observed in the 
levels of phosphorylated EIIAC and se- 
creted cAMP. However, other evidence 
has indicated that an additional factor 
is required for the phosphorylated EIIA- 
mediated stimulation of cAMP secretion. 

The non-phosphorylated EIA interacts 
with proteins of several non-PTS sugar- 
transport systems (e.g., lactose perme- 
ase and maltose permease), and inhibits 
their activities, which leads to a non- 
transportation of lactose inside the bac- 
terial cell. In the Firmicutes, it is the 
histidine protein (HPr) that exerts this 
role, with HPr being phosphorylated not 


only at His15in a PEP-dependent reac- 
tion but also at Ser46 in an ATP-requiring 
reaction. Notably, the HPr exists in four 
different forms, all of which exert different 
regulatory functions. 

Whereas, catabolite repression has been 
studied extensively only in the Enterobac- 
teriaceae and Firmicutes, evidence exists 
in certain other pathogens of a relation- 
ship between carbon metabolism and vir- 
ulence. The mechanisms that are operative 
in carbon catabolite repression appear also 
to control virulence gene regulators, cell 
adhesion, and pili formation. Indeed, vari- 
ous studies have shown that the expression 
of the pilT and pilD genes of Clostridium 
perfringens, and of the multiple gene regu- 
lator (mga) gene of Streptococcus pyogenes, 
all of which encode a virulence regula- 
tor, are controlled by catabolite-controlled 
protein A (CcpA) [1, 13]. 


2.6 
Cyclic AMP Receptor Protein 


Cyclic AMP plays an important role 
in controlling the catabolic activity of 
both prokaryotic and eukaryotic cells. In 
prokaryotes, cAMP modulates transcrip- 
tion through CRP (also known as CAP), 
whereas in eukaryotes cAMP modulates 
the enzyme activity via covalent modu- 
lation through cAMP-dependent protein 
kinases. 

Typically, CRP does not bind to DNA 
without the prior binding of cAMP. 
Among the genes that are activated in 
bacteria in response to an increase in 
cAMP are those that encode the enzymes 
for the catabolism of lactose, arabinose, 
galactose, and maltose. The presence of 
cAMP is necessary for the activation of 
transcription in bacteria, a situation that 
has been demonstrated by mutating the 
gene coding for adenylyl cyclase (cya7), 


which converts ATP into cAMP. IfcAMP is 
added externally to such a system, then an 
activation of the transcription will occur. 
Promoters of the operons — the expres- 
sion of which depends on cAMP and 
CRP — contain specific sites for binding 
the cAMP-—CRP complex. The in vitro tran- 
scription of DNA fragments containing 
cAMP-—CRP-dependent promoters is also 
activated by cAMP-—CRP. In some cases, 
mutant promoters have been isolated at 
which cAMP-CRP is unable to bind; in 
this case, the cAMP-—CRP fails to activate 
transcription, both in intact cells and in 
vitro. 

More recently, it has been shown that 
one CRP dimer (after binding two cAMP 
molecules) binds at the specific site in 
the operon where transcription is acti- 
vated by cAMP—CRP. Aided by the results 
of CRP protection experiments to mon- 
itor “chewing” by DNase, it was shown 
that approximately 25 base pairs are pro- 
tected by cAMP—CRP against chemical 
attack, and that the mutations which pre- 
vent CRP binding are located within these 
sequences. The results of the experiments 
also indicated that CRP forms major con- 
tacts in two successive grooves of the 
DNA, with the most conserved sequence 
to bind CRP being 5’TGTGA3’. Other ev- 
idence has indicated that the 5’TGTGA3’ 
sequence is critical for CRP binding. Point 
mutations that are known to prevent sta- 
ble CRP binding are located at the gal and 
lac sites whilst, at the ara site, the results 
of deletion experiments highlighted the 
importance of this sequence for CRP bind- 
ing. Another sequence 6 bp downstream 
of the TGTGA motif had an inverted re- 
peat although, in many cases, this was not 
an exactly inverted repeat sequence. Irre- 
spective of the symmetry of the sequence, 
this second motif has also been shown 
necessary for efficient CRP binding. The 
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results of many types of experiment have 
indicated that the two subunits of CRP rec- 
ognize two zones of sequences, separated 
by 6 bp. As noted above, the first of these 
zones contains the sequence 5’TGTGA3’, 
while the second zone contains either a 
symmetrically arranged version of the se- 
quence or another type of sequence. How- 
ever, the affinity of CRP for DNA appears 
to be greater when the 6 bp downstream 
sequence of 5/TGTGA3’ is symmetrical 
rather than non-symmetrical. The distance 
between the transcription start point and 
the CRP binding site is different for the 
various promoters. For some promoters, 
such as those for the ara operon and the 
mal operon, an additional protein — the 
Ara C protein or the Mal T protein - 
is required to activate the transcription 
(these activator proteins also bind to the 
promoter). In some cases (e.g., lac, cat) 
two CRP binding sites have been found, 
with the secondary sites binding CRP less 
tightly and assisting in the quest for CRP 
at the primary sites [14, 15]. 

Although, the main function of CRP is 
to activate transcription, in some cases the 
binding of CRP has been shown to repress 
transcription. Two promoters, P1 and P2, 
are located at the gal promoter. The bind- 
ing of CRP at P1 causes the transcription 
to be activated, whereas CRP binding at 
P2 causes it to be repressed. This situation 
occurs because, at P2, the CRP binds close 
to the —35 region and blocks the binding 
of RNA polymerase at the P2 promoter. 
CRP also acts as a repressor of transcrip- 
tion of its own promoter in vitro. It also 
inhibits transcription of the gene for the 
major outer membrane protein, Omp A, 
again by binding close to the —35 region 
of the promoter [14]. 

CRP is a dimeric protein with two iden- 
tical subunits, each containing 210 amino 
acids, the complete sequence of which 
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has been deduced from the nucleotide 
sequence of the gene. The results of equi- 
librium dialysis studies have indicated that 
two molecules of cAMP can bind per CRP 
dimer, while CRP has a two-domain struc- 
ture, as confirmed by the high-resolution 
crystal structure of CRP when complexed 
with cAMP. A large N-terminal domain 
that extends from residue 1 to 135 is 
separated by a cleft from a smaller C- 
terminal domain (CTD) that extends from 
residue 136 to 210. The N-terminal do- 
main of each subunit contains one cAMP 
molecule buried in the interior of the pro- 
tein, while residues from both subunits 
are involved in the binding of cAMP. Typ- 
ically, the 6-amino group of the adenine 
ring in cAMP interacts with Thr127 on 
one subunit, and with Ser128 on the other 
subunit. 

The N-terminal domain of CRP in the re- 
gion of residues 30-89 exhibits sequence 
homology with the regulatory subunit of 
the protein kinase of eukaryotes. The reg- 
ulatory subunit of protein kinase also has 
two cAMP-binding sites. 

The CTDs of the two CRP subunits 
consist of three a-helices connected by 
short, B-sheet structures. On each subunit, 
one of the a-helices protrudes from the 
surface of the CRP dimer; these two a- 
helices are considered to be involved in 
DNA binding. The other DNA-binding 
proteins, such as Cro and cl proteins, also 
have a-helices but these are located at the 
N-terminal region. All of the DNA-binding 
proteins have been shown to have a 
helix—turn—helix domain that is essential 
for interactions with DNA. However, the 
E. coli fnr protein, which is essential for 
the anaerobic respiratory metabolism, also 
has a helix—turn—helix domain in the 
C-terminal region. Additional homology 
is also found in the N-terminal regions 
of the two proteins. Although the fnr 


protein does not bind to cAMP, it has 
a somewhat similar function as CRP, 
serving as a pleiotropic activator for a 
series of genes that are turned on under 
limiting aerobic conditions. Subsequent 
sequence comparisons indicated that the 
fur gene might have been derived as a 
result of duplication either from the CRP 
gene itself, or from a common ancestor. 
Interactions between RNA polymerase 
and promoters may be described as a two- 
step event. In the first step, the enzyme 
binds to the promoter to form a closed 
complex; this binding is reversible and 
characterized by an association constant, 
Kg. In the second step, the closed complex 
isomerizes to give rise to an open complex; 
this isomerization includes a localized 
unwinding of the DNA over a distance of 
approximately 12 bp near the transcription 
start, it is generally irreversible, and the 
corresponding rate constant, Kz, is slow. 
Strong promoters have high values of both 
Kg and Kg, whereas weak promoters have 
low values for both constants. The addition 
of cAMP and CRP has two effects on the 
lac promoter: (i) it enhances the rate of 
open complex formation by increasing the 
value of Kg without affecting K¢; and (ii) 
the presence of cCAMP—CRP increases the 
binding of RNA polymerase on the P1 
promoter. This latter increase is due to 
an inhibition of RNA polymerase binding 
on other secondary sites in the presence 
of cAMP-CRP. The structure of the 
CRP-DNA complex is interesting in that 
the DNA has a bend, and the proteins may 
distort the double-helical structure of DNA 
when they bind, while several regulatory 
proteins may induce a bend in the axis. 
Consequently, a dramatic change occurs 
in the organization of the DNA double 
helix following the binding of CRP [14]. 


2.7 
Guanosine-5’-Diphosphate,3’-Diphosphate 


The rel A gene, which is required for the 
synthesis of guanosine-5’-diphosphate, 3’- 
diphosphate (ppGpp), has been shown to 
enhance the transcription of the lac Z and 
glg genes (glg genes code for glycogen 
biosynthetic enzymes). It has been indi- 
cated that ppGpp interacts directly with 
RNA polymerase to alter the transcrip- 
tion of various genes and, indeed, a small 
protein has been shown to mediate the 
effect of ppGpp on the lac Z gene un- 
der certain conditions. Nitrogen regulatory 
proteins C and A (Ntr C and Ntr A) have 
also been shown to activate the gln pro- 
moter. The ntr C and ntr A genes encode 
a specific DNA-binding protein and an al- 
ternate sigma factor for RNA polymerase, 
respectively. However, neither Ntr C nor 
Ntr A increased the synthesis of glycogen 
biosynthetic enzymes [16-19]. 


2.8 
Riboswitch 


Previously, various research groups were 
of the opinion that the regulation of gene 
activity in response to environmental cues 
was mediated only by proteins. In fact, in 
the classical model of gene regulation, the 
cells monitor their environment through a 
variety of specialized sensor proteins that 
are deployed either on their surfaces or 
internally. Today, riboswitches have been 
demonstrated as mRNAs that sense the 
environment directly and shut themselves 
down in response to particular chemical 
cues. Recently, it was shown that bacterial 
genes for enzymes which direct the syn- 
thesis of vitamin B,, employ a riboswitch. 
In this case, the mRNAs transcribed from 
these genes were shown to fold into a spe- 
cialized shape, creating a binding pocket 
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for coenzyme B12. Following Biz binding, 
the mRNA would alter its shape in such 
a way as to mask a nearby sequence that 
otherwise would instruct the ribosomes to 
start reading at that point. Consequently, 
when coenzyme By is abundant these 
sequences are hidden and the enzymes 
for Biz synthesis are no longer produced. 
Many other riboswitches have been re- 
ported in bacteria, including those that 
control the synthesis of vitamins B, and 
By, and guanine nucleotides. There is 
also some evidence that riboswitches are 
present in plants and fungi [20, 21]. 


2.9 
Regulon 


Regulon are also referred to as multigene 
systems or global regulatory systems. In con- 
trast to operons, the coordinately regulated 
genes of a regulon are located physiolog- 
ically at different parts of the chromo- 
some, and are controlled by their own 
promoters, but are regulated by the same 
mechanisms. One of the most well-known 
examples of a regulon is the production of 
heat shock proteins (Hsps) in E. coli which, 
as a mesophile, exhibits normal growth 
at between 20 and 37°C. The bacterium 
responds to an abrupt increase in tempera- 
ture, from 30 to 42 °C, by producing a set of 
almost 30 different proteins, termed collec- 
tively Hsps; in fact, when the temperature 
is raised from 30 to 42°C, Hsp produc- 
tion by E. coli is increased almost 10-fold 
within a 5 min period. Subsequently, the 
Hsp level decreases slightly to a steady- 
state level, which is maintained while the 
cells remain at the elevated temperature. 
If the temperature is then decreased from 
42 to 30°C, the levels of Hsps decrease 
abruptly almost 10-fold within the same, 
5-min period. In addition to a change in 
temperature, however, other agents (viz. 
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organic solvents or other DNA-damaging 
agents) can induce heat shock gene ex- 
pression. Clearly, the heat shock regulon 
deals with the variety of cellular damage 
that may occur in many different ways [22]. 

Many of the Hsp genes encode either 
proteases or chaperones. The proteases 
degrade any abnormal proteins, including 
incompletely synthesized and misfolded 
proteins, whereas the chaperones bind 
to abnormal proteins, causing them to 
unfold and then attempt to re-fold into 
an active configuration. Although the 
genes encoding Hsps are scattered around 
the chromosome, they are coordinately 
regulated and therefore they constitute a 
regulon. The regulator of the heat shock 
response is an alternative sigma factor, 
named sigma 32 (07); this protein has a 
molecular weight of 32 kDa, is involved 
in the initiation of the transcription of 
heat shock genes by recognizing the heat 
shock promoters, and is coded by a gene 
known as rpoH (RNA polymerase subunit 
heat shock). The heat shock promoters 
have different —10 (CCCCAT) and —35 
(CTTGAAA) consensus sequences than do 
promoters that are recognized by a normal 
sigma factor with a molecular weight of 
70 kDa. Typically, o** is unstable at low 
temperatures, with a half-life of about 
1 min, but is almost fivefold more stable 
at a higher temperature. Regulation of the 
rpoH gene occurs at the translational level; 
indeed, although a significant amount of 
rpoH mRNA may be detected in cells at low 
temperatures, it is not translated. At high 
temperatures, the inhibition of translation 
is relieved and the synthesis of o * occurs. 
Previously, it has been shown that two 
regions of the rpoH mRNA are required 
for translational inhibition — one region 
close to the +1 site, and another at between 
+150 and +250 in the mRNA. These two 
regions form a stem-loop structure, which 


may prevent binding at the ribosome 
binding site and thus inhibit translation, 
as well as possibly increasing the stability 
of this mRNA [22]. 

The o** protein is degraded by a specific 
protease, termed Hfl B. The degradation 
of o*? at 30°C also requires a chaperone 
composed of three proteins, termed DnaK, 
DnaJ, and GrpE. The degradation of o*# 
at 30°C is decreased almost 10-fold by 
mutations in any one of the genes that 
code the Hfl B, DnaK, DnaJ, and GrpE 
proteins. 

Evidence is available that the inter- 
actions between DnaK and o* are 
temperature-dependent and occur only 
at low temperatures. At higher tempera- 
tures, o** is capable of interacting with 
RNA polymerase, but is unable to in- 
teract with DnaK. It has been assumed 
that this temperature-dependent interac- 
tion between DnaK and o* brings stability 
to o**, although when the temperature 
falls from 42 to 30°C a translational 
inhibition of mRNA returns and o*” 
again becomes sensitive to degradation. 
Such temperature-sensitive properties al- 
lows the heat shock response to be turned 
on and off very quickly. 

A second heat-induced regulon is 
controlled by another sigma factor, 
known as sigma E (o®). The o-controlled 
promoters are much more active at 
about 50°C; in fact, deletions of the gene 
encoding o® have been shown to be 
temperature-sensitive at 42°C, whereas 
deletions of the gene encoding o* are 
temperature-sensitive at 20°C. The of 
promoter responds to misfolded outer 
membrane proteins, whereas o* re- 
sponds to misfolded cytoplasmic proteins. 
The o* gene also has a o* promoter, 
so that all of the Hsps are induced by a 
cascade effect when o*-regulated genes 
are expressed. Typically, the o* regulon 


provides proteins which are required 
under more extreme conditions [22]. 
Another important example of the reg- 
ulons is the SOS regulon, which becomes 
activated in response to extensive DNA 
damage. Previously, Weigle et al. were the 
first to demonstrate the induction of DNA- 
repairing genes in case of reactivated ul- 
traviolet light-irradiated lambda (A) phage. 
Similar to the heat shock regulon, the SOS 
regulon also has a mechanism for signal- 
ing the “on” and “off” of the regulon. In 
prokaryotes, the SOS system is regulated 
by two main proteins, namely LexA and 
RecA. The transcription of almost 48 genes 
has been shown to be regulated by the LexA 
protein, a homodimer that acts as a tran- 
scriptional repressor after binding with 
a sequence near the promoter/operator 
region in these proteins, called the SOS 
box. In E. coli, the SOS boxes are al- 
most 20 nucleotide-long sequences with 
a palindromic structure and a conservative 
sequence. In other prokaryotes, however, 
the sequence of the SOS boxes varies 
considerably, with different lengths and 
compositions. Nonetheless, in all cases 
the sequence is conservative and is one of 
the strongest short signals in the genome. 
Those SOS promoters that are bound by 
LexA are unable to initiate transcription al- 
though, upon DNA damage the LexA is in- 
activated and removed, which results in an 
expression of the SOS genes. Previously, 
it has been shown that, upon exposure 
to DNA-damaging agents, large amounts 
of single-stranded DNA become accumu- 
lated such that single-stranded DNA will 
bind to the RecA protein, which is in- 
volved in homologous recombination and 
postreplication DNA repair. At the time 
when the DNA damage occurs, the RecA- 
bound single-stranded DNA will bind to 
LexA and induce the latter to cleave itself 
(autocleavage). The autocleavage of LexA 
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has been shown to take place between 
two specific amino acids that separate 
the repressor into two domains — the 
DNA-binding domain and the dimeriza- 
tion domain. As a result of this disruption 
of dimerization, LexA is removed from the 
SOS box, after which the SOS genes are 
expressed at high levels. Subsequent to the 
SOS response, the amount of RecA that is 
complexed to single-stranded DNA will be 
decreased due to DNA repair, while LexA 
fails to undergo autocleavage; this results 
ina return of the regulon to the uninduced 
state [22—26]. 

During the SOS response, cell division 
is also halted, so that any damaged 
chromosomes do not become segregated 
into the daughter cells. Consequently, 
during the SOS response, in addition to 
the DNA-repair enzymes a cell division- 
inhibitory protein is also expressed, at high 
levels. 


3 
Regulation of Gene Expression in 
Eukaryotes 


In comparison to prokaryotes, the eukary- 
otes have a much more complex regulatory 
mechanism of transcription, with RNA 
splicing also playing an important role in 
the regulation of gene expression. In addi- 
tion to the activation of gene structure, the 
polyadenylation, capping, transport to the 
cytoplasm, and translation of mRNA rep- 
resent potent control points in the process 
of regulating gene expression. Five poten- 
tial control points for regulating gene ex- 
pression in eukaryotes are shown in Fig. 6. 
The most important method of control is 
to regulate the initiation of transcription 
(ie., the interaction of RNA polymerase 
with the promoter region), which may be 
demonstrated using a technique known as 
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run-off transcription. In this case, the nuclei 
are first isolated from the cells and then 
incubated with radiolabeled nucleoside 
triphosphates. Under suitable conditions, 
unfinished transcripts will be completed, 
but no new transcripts will be synthesized; 
consequently, the RNA that is labeled by 
using this method will have been derived 
from those genes that started transcription 
at the time the nuclei were isolated. Sub- 
sequently, when the labeled RNA is used 
to probe DNA from a clone of genes under 
investigation, an absence of hybridization 
between the labeled RNA and the cloned 
DNA indicates that the DNA was not tran- 
scribed in the tissue. The use of this 
technique to examine several genes has 
led to the realization that an absence of 
gene expression does, indeed, result from 
an absence of transcription [27]. 
Nowadays, DNA microarray technology — 
which is more commonly known as ‘“‘gene 
chip technology” — is also used widely to 
identify the presence of complementary 
sequences of DNA. In fact, microarray 
technology can be regarded very much 
as a modern-day ‘genetic revolution,’ and 
comparable to the development of micro- 
processors in the computer revolution of 
30 years ago. Today, with the advent of 


Fig. 6 Regulation of gene expres- 
sion. Gene regulation may take place 
in a gene-specific manner at any of 
the several sequential steps. How- 
ever, there are five potential control 
steps. 


microarray technology, the task of screen- 
ing genetic information has become an 
automatic routine that exploits the ten- 
dency for a molecule, that is carrying 
a template for synthesizing mRNA and 
protein, to bind to the very DNA that 
produces it. Currently, microarrays incor- 
porate many thousands of probes, each of 
which is imbued with a different nucleic 
acid from known and unknown genes, to 
bind with mRNA. Subsequently, the re- 
sulting bonded molecules will fluoresce 
under different colors of laser light, thus 
demonstrating which complementary se- 
quence is present. In this way, these 
microarrays can be used to measure the 
incidence of genes and their expression. 

More recently, following the determina- 
tion of the human genome sequence, the 
importance of the single nucleotide poly- 
morphism (SNP) has also been realized. 
The SNPs represent minor variations in 
DNA that define the differences that oc- 
cur among people, that may predispose 
a person to disease, and that may in- 
fluence a patient’s response to a drug. 
Consequently, with the genetic make-up 
of humankind now broadly known, it is 
possible to create microarrays that are 
capable of targeting individual SNP varia- 
tions, and thereby to make much greater 
comparisons across the genome. Taken 
together, the results of these studies may 
help to identify the roots of many dis- 
eases, especially when combined with 
specific software that has been developed 
to design microarrays incorporating very 
large numbers of probes [28]. 


The real-time-polymerase chain reaction 
(RT-PCR) has also been used to quantify 
the level of gene expression. This tech- 
nology, which is both highly sensitive and 
convenient, includes approaches that serve 
as a natural complement of transcriptome 
analysis, either when the tuning of ar- 
ray results is necessary or when an array 
sensitivity limit is reached for low-level 
transcripts of interest. In RT-PCR, the 
sensitive quantification of PCR products 
relies on the detection of a fluorescent 
signal that is proportional to the amount 
of product. Typically, PCR products can 
be measured in real time by using a dye 
that will bind with double-stranded, but 
not single-stranded, DNA, or with labeled 
oligonucleotides that can bind specifically 
to the PCR products. 

The cells of multicellular organisms are 
genetically homogeneous but structurally 
and functionally heterogeneous, because 
of the differential expression of genes. 
Many of these differences in gene ex- 
pression occur during development, and 
are retained through mitosis. Stable alter- 
ations of this type are referred to as being 
epigenetic, because they are heritable in the 
short term but do not involve mutations 
of the DNA itself. The term “epigenet- 
ics” is used to define the mechanism by 
which changes in the pattern of inherited 
gene expression occur in the absence of 
alterations or changes in the nucleotide 
composition of a given gene. In the past, 
research investigations have been focused 
on two molecular mechanisms that medi- 
ate epigenetic phenomena, namely DNA 
methylation and histone modifications. 
Previously, it has been shown that epi- 
genetic effects via DNA methylation have 
an important role in development, but can 
also arise stochastically as animals age. 
The identification of proteins that mediate 
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these effects has been helpful in eluci- 
dating the epigenetic effect which, when 
perturbed, may result in disease. Typically, 
external factors that apply to epigenetic 
processes are associated with the diet in 
long-term diseases, such as cancer. In- 
deed, it has been proposed that epigenetic 
mechanisms might allow an organism 
to respond to the environment through 
changes in gene expression [29, 30]. 

The fact that many genes are transcribed 
in one tissue or organ, but not in others, 
may explain the need for cell differenti- 
ation in eukaryotic organisms, whereby 
some genes are expressed under the in- 
fluence of certain signaling agents, such 
as the substrates of specific enzymes, hor- 
mones, and regulatory nucleotides. Gene 
expression under the influence of certain 
signaling agents has also been considered 
as the phenomenon of induction, which is 
less prominent among eukaryotes than in 
prokaryotes. Typically, in eukaryotes more 
time is required for induction, and the 
extent of stimulation may be only 10- to 
20-fold; this contrasts greatly with bacte- 
ria, where many thousand-fold levels of 
stimulation may occur as a result of induc- 
tion. Since, in eukaryotes, monocistronic 
mRNAs are generally found, compared 
to polycistronic RNAs in prokaryotes, 
coordinate induction has not been re- 
ported in eukaryotes. Many years ago, the 
so-called “‘Britten—Davidson model” was 
proposed to explain the induction phe- 
nomena in eukaryotic genes, according to 
which the eukaryotic genome contains a 
large number of sensor sites that recognize 
specific molecular signaling agents such 
as hormones and the substrates of specific 
enzymes. Each sensor site is adjacent to 
an integrator gene such that, when a sen- 
sor site is activated following the binding 
of a signaling agent, the integrator gene 
is transcribed to form its complementary 
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Fig. 7. The Britten—Davidson model of gene 
regulation. When a signaling agent such as 
a hormone binds to a sensor site, transcrip- 
tion of the integrator gene occurs, such that 
complementary RNA (the activator RNA) is 


RNA, termed an activator RNA. The latter, 
in turn, is recognized by one or more re- 
ceptor sites that are located elsewhere in 
the genome and may be on the same or 
another chromosome. It is when the acti- 
vator RNA binds to the receptor site that 
an adjacent structural gene is transcribed 
[1, 31, 32] (Fig. 7). 

Although, initially, the phenomenon of 
coordinate expression in eukaryotes was 
considered nonexistent owing to the pres- 
ence of monocistronic mRNAs, the use 
of DNA microarrays and global expres- 
sion analysis has illustrated the highly 
coordinate expression of genes that func- 
tion in common processes in eukaryotes. 
This process, which has been termed ‘“‘syn- 
expression’ in eukaryotes, has been con- 
sidered comparable to the role of operons 
in prokaryotes. Moreover, it has also been 
proposed as a key determinant facilitat- 
ing evolutionary change leading to animal 
diversity. By using DNA microarrays, the 
simultaneous monitoring of thousands of 
transcripts is possible, and this has in turn 
provided global insights into gene expres- 
sion. Ultimately, however, the expression 
data have revealed a high degree of order 
in the genetic program, and a tight coordi- 
nation of the expression of groups of genes 
that function in a common process [33]. 


formed. The activator RNA is recognized by 
the receptor site located elsewhere in the 
genome. When the activator RNA binds to the 
receptor site, the adjacent structural gene is 
transcribed and subsequently translated. 


It was while working with mammalian 
cells infected with SV40 virus that Fren- 
ster first suggested the existence of a 
de-repressor model for gene regulation in 
eukaryotes. Based on experimental data 
indicating the ability of exogenous DNA 
or RNA to de-repress specific loci on the 
host cellular genome, this model sug- 
gested a close relationship to the normal 
mechanisms of gene regulation in animal 
cells, which may be subverted to allow 
the re-expression of otherwise repressed 
embryonic information. This derepres- 
sion model accounted for a selective gene 
transcription that was locus- and strand- 
specific, but failed to discuss gene—gene 
interaction. 

Subsequently, Frenster proposed 
a Mated Model of gene regulation 
in eukaryotes, according to which a 
derepressor RNA (dRNA) binds to the 
anticoding strand of an operator locus, 
thus permitting the transcription of 
operator and structural gene loci. The 
dRNA of an operon is complementary in 
base sequence to its operator portion of 
the direct transcription product. Following 
this, the direct transcription product 
would be split into mRNA and operator 
RNA (oRNA), with cleavage occurring 
either directly or after the formation 


of a heterometric duplex RNA by the 
base-pairing of dRNA with the operator 
portion. Thereafter, the dRNA would be 
removed selectively from the operator 
locus, providing a feedback inhibition of 
transcription of the operon. Following the 
consumption of mRNA and degradation 
of the oDNA, the dRNA would be released 
from the duplex, providing a positive 
feedback derepression of transcription of 
the operon. As the different structural 
genes may share operators with common 
base sequences, they would be equally 
sensitive to given species of URNA, during 
both transcription of the gene and its 
selective inhibition [34], (Fig. 8). 

In eukaryotes, cell division is normally 
highly regulated, aided by growth factors 
that cause the cells to undergo cell division 
and, in some cases, cell differentiation. 
Among these growth factors, some are 
specific for certain types of cell, due to 
specific receptors present at the cell sur- 
face, while others are general rather than 
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specific in their effects. Other growth fac- 
tors which control cell division include 
epidermal growth factor, nerve growth 
factor, platelet-derived growth factor, fi- 
broblast growth factor, and lymphokines. 
Occasionally, the failure of growth factors 
to control cell division may lead to the 
creation of tumors. 

Although, in prokaryotes, the negative 
control of transcription plays an important 
role (e.g., fail-safe mechanism), a positive 
control in eukaryotes is even more im- 
portant, for the simple reason that, in a 
large genome, such an approach is more 
efficient. If a large number of genes is 
to be negatively controlled, then each cell 
would need to synthesize the same num- 
ber of different repressors in sufficient 
amounts as to permit the specific binding 
of each. In addition, the nonspecific DNA- 
binding of regulatory proteins (repressors) 
is especially important in much larger 
genomes of the higher eukaryotes, as the 


Transcription 


: 
/ LLU LLY 
’ 


gs Ws. 


- fe) sg Cleavage ORNA mRNA 
: Transcription product 
“ dRINA mm 
‘s — UW , UU, gRNa 
o sg Cleavage oRNA mRNA 
Heterometric Homometric 
duplex RNA RNA 


Fig. 8 Mated model for gene regulation. o, 
operator; sg, structural gene; dRNA, derepres- 
sor RNA. The derepressor RNA (dRNA) binds 
to the anticoding strand of the operator locus, 
permitting transcription of the operator and 


structural gene loci. The transcription prod- 
uct is then split into mRNA and operator RNA 
(oRNA). Cleavage may occur either directly, 

or after the formation of heterometric duplex 
RNA. 
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chance of a specific-binding sequence be- 
ing present at random at inappropriate 
sites would also increase with genome size 
[35]. 


3.1 
Transcriptionally Active Chromatin 


The major part of the eukaryotic genome 
is sequestered in the nucleus, where it 
is surrounded by a nuclear membrane to 
safeguard it against exposure to the cyto- 
plasm. As the transcription of genes also 
occurs within the nucleus, but translation 
occurs mainly in the cytoplasm, the two 
processes cannot be coupled. Typically, 
the chromosomes of eukaryotes are more 
complex than those of bacteria, with each 
containing a double-helix DNA molecule 
that may be more than 20-fold larger than 
that of a bacterial chromosome. In eu- 
karyotes, the DNA is tightly complexed 
with histone proteins that are thought to 
have structural and protective functions; 
other loosely bound nonhistone proteins 
are also generally present, albeit in smaller 
quantities than the histones. Although the 
functions of the nonhistone proteins are 
not clear, they may have role(s) in tran- 
scription and/or replication. Among other 
DNAs present in the cells, mitochondrial 
and chloroplast DNAs are both small, 
double-stranded molecules. Typically, the 
mitochondrial DNA of plants is larger than 
that of animals, while all plants appear 
to have similarly sized chloroplast DNAs. 
The mitochondrial and chloroplast DNAs 
resemble bacterial DNA but, unlike eu- 
karyotic nuclear DNA, are not associated 
with histones [36]. 

It has been shown that those chromoso- 
mal regions which have been activated for 
transcription are more sensitive to DNase 
degradation, which is indicative of their 
lesser degree of protection by histones. 


The actively transcribed regions have also 
been found to include sequences with a 
high sensitivity to DNase, termed hyper- 
sensitive sites. The latter are generally no 
longer than 200 bp, and are found within 
the 1000 bp that flank the 5’ ends of the 
transcribed genes. In some cases, these 
hypersensitive sites may be located farther 
from the 5’ end, close to the 3’ end, or 
even within the gene itself. Many hyper- 
sensitive sites have been found to serve as 
binding sites for the regulatory proteins 
[37]. 

The telomeres are specialized structures 
which are located at the ends of linear 
eukaryotic chromosomes, and which gen- 
erally have many tandem copies of a short 
oligonucleotide sequence (T,G,) in one 
strand, with C,A, in the complementary 
strand (where a and b are 1 to 4). The 
structure of the telomere poses a bio- 
logical problem, however, in that DNA 
replication requires a primer, but in linear 
DNA molecules it is impossible to synthe- 
size an RNA primer starting at the end 
nucleotide. However, this problem is re- 
solved by employing telomerase, an enzyme 
which resembles reverse transcriptase and 
catalyzes the addition of telomeres to the 
chromosome ends. Within its structure, 
telomerase has both protein and RNA 
regions; the RNA portion is about 150 nu- 
cleotides in length and has about 1.5 copies 
of the C,A, telomere repeat that serves as 
a template for the synthesis of the T,G, 
strand. The telomerase-like reverse tran- 
scriptase synthesizes only a segment of 
DNA that is complementary to an internal 
RNA molecule [38]. 

The DNA in transcriptionally active 
chromatin has been found to be methy- 
lated to a lesser degree; moreover, nu- 
cleosomes have not been found in the 
transcriptionally active regions (at least in 


some cases). Chromatin has been clas- 
sified into two groups: heterochromatin, 
a highly condensed chromatin which is 
transcriptionally inert; and euchromatin, a 
loosely packed chromatin which is tran- 
scriptionally active. 


3.2 
Regulation of Gene Expression at the 
Initiation of Transcription 


Although the regulation of gene expres- 
sion at the initiation stage of transcription 
(i.e., the binding of RNA polymerase to the 
promoter) has been demonstrated, there is 
at present no evidence for the control of 
gene expression at the subsequent stages 
of transcription. Three RNA polymerases 
have been identified in eukaryotic cells 
as being involved in the biosynthesis of 
different classes of RNAs. For example, 
the biosynthesis of heterogeneous RNA 
(HnRNA) occurs in the presence of RNA 
polymerase II, while the initiation of tran- 
scription by RNA polymerase II is regu- 
lated by a series of DNA elements that may 
be divided into the core promoter elements 
consisting of the TATA box, the transcrip- 
tion initiation site, and upstream activating 
sequencess (UASs). The UASs are gener- 
ally located upstream of the core promoter 
sequence, although in some cases they 
have been found downstream of the tran- 
scription start site (Fig. 9). In this case, a 
specific protein is bound to each UAS, and 
this results in a positive or negative effect 
on the core promoter activity. The TATA 
box is found generally 25 bp upstream 
from the transcription initiation site and, 
although it is common in eukaryotic genes, 
very few genes have been shown to be ex- 
pressed without the TATA box. In addition 
to the TATA box, another sequence — re- 
ferred to as the CAAT box — has been found 
at the —75 position from the initiation site. 


Regulation of Gene Expression 


The CAAT box (which has the consensus 
sequence GGCAATCT) functions in either 
orientation, and plays an important role in 
increasing the promoter strength. A GC 
box at the —90 position from the initiation 
site has also been found; this may occur in 
either orientation, and is a common com- 
ponent of the promoters of housekeeping 
genes. The GC box has the consensus 
sequence GGGCCGGG. 

Transcription factor II D (TFIID) for 
RNA polymerase II has been shown to 
play an important role in the initiation of 
transcription, by binding to the TATA box 
sequences. Such binding of TFIID facili- 
tates the binding of the RNA polymerase 
II on to the promoter. The assembly of an 
initiation complex and RNA polymerase 
at the promoter is a complex process that 
requires the participation of many other 
initiation factors. Typically, TFIID has two 
components — the TATA box binding pro- 
tein (TBP) and another protein termed the 
TBP-associated factor (TAF). Whilst TAF 
is important for regulating the transcrip- 
tion, TBP — which is also referred to as 
the “commitment factor” — binds to DNA 
in the minor groove. The inner surface of 
the TBP binds to DNA, while the outer 
surface is available to extend contacts to 
other proteins. The DNA-binding sites of 
TBP consist of sequences that are con- 
served between species, with the variable 
N-terminal tail being exposed to interact 
with other proteins. Normally, TBP is the 
only transcription factor to make contacts 
with the specific sequences in the DNA. 
The activity of TFIID has also been shown 
to be regulated by inhibitory proteins that 
interact with TBP; these inhibitory pro- 
teins may serve an important regulatory 
role by maintaining any genes that have 
been removed from inactive chromatin in 
a repressed, but rapidly inducible, state 
[39-41]. 
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Fig. 9 Regulation of eukaryotic gene ex- 
pression at the transcriptional level. Eukary- 
otic control is often positive: trans-acting 
factors bind to cis-acting sites in order for 
RNA polymerase to initiate transcription at 
the promoter. (a) Transcription is turned 


A number of other factors also play 
important roles in the regulation of initia- 
tion of transcription. For example, TFIA, 
which joins the initiation complex after 
TFIID, has two subunits in yeast and 
three in mammals. Following the joining 
of TFIIA, the TFIID is able to protect a re- 
gion that extends further upstream, while 
the addition of TFIIB provides further pro- 
tection to the region of the template in 
the vicinity of the start point, from —1 
to +10 bp. TFIF is a dimeric protein, 
in which the larger subunit has an ATP- 
dependent DNA helicase activity that may 
be involved in opening the DNA during 
initiation, while the smaller subunit has 
been found to be equivalent to the sigma 
factor of E. coli. The TFIIF brings RNA 
polymerase II to the assembling transcrip- 
tion complex, and also provides the means 
for its binding; interaction with TFIIB may 
be important when TFIIF joins the com- 
plex. Polymerase binding extends the sites 
that are protected downstream to +15 
on the template strand and +20 on the 
nontemplate strand. TFIIE binds at the 
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off by default (if the correct initiation fac- 
tors are not bound in the regulatory region); 
(b) Transcription becomes turned on after the 
binding of initiation factors on the regulatory 
region, after which RNA polymerase binds to 
the promoter region. 


upstream boundary. Two more factors — 
TFIIH and TFIIJ — also join the complex 
after TFIIE; TFIIH has kinase activity that 
may phosphorylate the CTD tail of RNA 
polymerase II, which consists of multiple 
repeats of the consensus sequence, Tyr- 
Ser-Pro-Thr-Ser-Pro-Ser, that is unique to 
RNA polymerase I]. Phosphorylation of 
the tail (at either seryl or threonyl residues) 
is required to release the enzyme from the 
transcription factors, so that it can leave 
promoter region to start elongation. The 
TATA box determines the location of the 
start point, while the general initiation 
process of transcription is the same as 
in bacteria. The enzyme RNA polymerase 
generates a closed complex, and subse- 
quently is converted into an open complex 
where the DNA strands become separated. 
The removal of TFIIE occurs during the 
process of open complex formation. 

The CAAT box is recognized by the pro- 
teins of the CCAAT-binding transcription 
factor (CTF) family, which are generated 
by the alternative splicing from a single 
gene. The CAAT box-binding protein 1 


(CP1) factor binds to the CAAT boxes 
of a-globin, while CP2 binds the CAAT 
box in a -fibrinogen gene. Other pro- 
teins also bind to the CAAT boxes; for 
example, the albumin CCAAT factor (ACF) 
protein binds CAAT in the albumin pro- 
moter. The CAAT box may also serve as 
a regulatory point; in embryonic tissues, a 
protein referred to as the CAAT displace- 
ment protein (CDP) binds to the CAAT 
boxes, preventing the transcription factors 
from recognizing them. In the testes, the 
promoter is bound by transcription fac- 
tors at the TATA box, CAAT box, and the 
octamer sequences. In embryonic tissues, 
the exclusion of a CAAT binding factor 
from the promoter prevents a transcrip- 
tion complex from being assembled. This 
behavior is analogous to the effect of a 
bacterial repressor. 

The GC box is recognized by the factor 
SP1, a monomeric protein which makes 
contacts on one strand of the DNA over 
a —20 bp binding site, including the GC 
box. In the SV40 promoter, the multiple 
boxes between —70 and —110 all bind 
this factor, thus protecting the whole 
region. However, in the thymidine kinase 
promoter, SP1 interacts with a factor at the 
CAAT box on one side, and with TFIID 
bound at the TATA box on the other side. 

Additional regulatory sequence ele- 
ments are enhancers in higher eukaryotes, 
and UASs in yeast. In the case of the 
enhancer, the location and orientation of 
sequences relative to the transcription start 
site are relatively unimportant. Typically, 
the enhancers exert their regulatory ef- 
fects even when moved experimentally, 
and they may occur naturally thousands 
of base pairs away from the gene which 
is being regulated. The enhancers have no 
promoter activity of their own, but may 
stimulate transcription over considerable 
distances. Moreover, the enhancers may 
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be involved in the regulation of gene ex- 
pression during the development of the 
organism, such as the immunoglobulin 
enhancer that only functions in B lympho- 
cytes. 

A regulatory role of enhancer activity 
has been identified in the transcription of 
genes that are responsive to steroid hor- 
mones. In this case, the steroid is bound 
onto a soluble protein, which in turn binds 
the enhancers for the steroid-responsive 
genes. Transcriptional activation is also 
accompanied by a decondensation of the 
chromatin in the regions containing the 
genes; this is evident from the fact that 
the region becomes more sensitive to 
DNase I digestion and subsequent bind- 
ing of the transcriptional factors to the 
promoter regions. An enhancer may also 
provide an entry site, a point at which RNA 
polymerase and/or other essential protein 
associates with chromatin. This involves 
the same type of interaction with the basal 
apparatus as the interactions promoted by 
the upstream promoter elements. 

The UASs found in yeast are analogous 
to the enhancers of higher eukaryotes, 
and are located upstream of the gene in 
a region having two identical sequences 
of 72 bp, each repeated at tandem 200 
bp upstream of the initiation start site. 
The —72 bp repeat is located within a 
hypersensitive site of chromatin. 

RNA polymerase I transcribes the genes 
for ribosomal RNA from a single type 
of promoter. The promoters for RNA 
polymerase I have the least diversity in 
the eukaryotic genome. The promoter, 
which has been found located 70 bp 
downstream ofa control element called the 
upstream control element (UCE), consists 
of a bipartite sequence in the region 
preceding the start site which, in turn, 
surrounds the start site extending from 
—45 to +20 bp, and is able to start the 
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transcription. The UCE located at —180 
to —107 bp increases the efficiency of the 
promoter. Both regions are enriched in 
terms of GC content, and two initiation 
factors are required for the initiation of 
transcription by RNA polymerase I. The 
upstream binding factor 1 (UBF 1) binds 
in sequence-specific manner to related 
sequences in the core promoter and UCE. 
Another factor, termed the spliced leader 
1 (SL1) binds cooperatively to UBF 1 
to extend the region of DNA. Following 
the binding of both the factors, RNA 
polymerase is able to initiate transcription 
after first binding with the promoter 
region. Notably, the SL1 is species-specific; 
for example, mouse SL1 cannot function 
on human DNA. The SL1, which consists 
of four proteins (including one known as 
TBP, that is required for the initiation of 
transcription by RNA polymerase II and 
III), has been considered analogous to the 
sigma factor of bacteria. 

RNA polymerase III transcribes the 
DNA coding for 5S rRNA, tRNA, and many 
small nuclear RNAs (snRNAs). The DNAs 
transcribed by RNA polymerase III are all 
smaller in size, generally less than 300 
nucleotides. Studies of the regulation of 
oocyte 5S rRNA synthesis have shown that 
it requires three transcription factors for 
initiation, known as TF IIIA, TF IIIB, and 
TF INC. TF IIA is a member of the zinc 
finger proteins, whereas TF IIIB consists 
of TBP and two other proteins, and TF 
IIC is still used as a partially purified 
preparation containing five subunits. TF 
IIIA has been shown to be specific for 
oocyte 5S rRNA, whereas TF IIIB and TF 
IIIC are required for the transcription of 
all DNAs by RNA polymerase III. 

Promoters recognized by RNA poly- 
merase III are of two types, lying upstream 
and downstream of the initiation site, 
and are recognized by different initiation 


factors. Typically, the promoters for 5S 
rRNA and tRNA genes lie downstream of 
the start site, whereas promoters for the 
snRNA gene are located upstream of the 
start point. The promoters for the 5S rRNA 
gene are located between —55 and +80 bp 
within the gene. 

The promoters for RNA polymerase III 
have a bipartite structure, with the two 
short sequences being separated by a 
variable sequence. The type I promoter 
consists of box A sequence separated 
from a box C sequence, while the type 
II promoter consists of a box A sequence 
separated from a box B sequence. In type 
II promoters, TF HIC recognizes box B 
but binds to a region involving box A, as 
well as box B. In the type I promoters, 
TF IIIA binds on box C. In promoters 
of both types I and II, the binding of 
TF INC facilitates the binding of TF IIIB 
to the sequence surrounding the start site. 
Recently, TF IIIB has been shown to be the 
main initiation factor for RNA polymerase 
III, whereas both TF IIIA and TF IIIC 
help TF IIIB to bind at the correct site. 
The efficiency of the transcription by RNA 
polymerase is found to be increased by 
the presence of the proximal sequence 
element (PSE). All the transcription factors 
for RNA polymerase bind at the promoter 
region, forming a preinitiation complex 
before the binding of RNA polymerase III 
onto the promoter [42-44]. 


3.3 
Regulation of Gene Expression in 
Chloroplasts 


In contrast to the nuclear genome, chloro- 
plasts have their own genetic system which 
has certain prokaryotic as well as eu- 
karyotic features. Many chloroplast genes 
are also organized as operons and, in 
contrast to nuclear transcription (where 


monocistronic RNA is transcribed), poly- 
cistronic RNA formation occurs in the 
chloroplasts. In addition, chloroplast gene 
expression more closely resembles the 
prokaryotic systems, as it has o7°-type 
promoters. The plastid operons are tran- 
scribed as polycistronic units by at least 
two distinct RNA polymerases — a plastid- 
encoded RNA polymerase (PERP), and 
the nuclear-encoded RNA polymerase 
(NERP). The PERP resembles the bacte- 
rial RNA polymerase, and consists of four 
different subunits, a, 6, 6’, and 6”, which 
are encoded on the plastid genome by the 
rpoA, rpoB, rpoC1, and rpoC2 genes, re- 
spectively. The activity of the PERP core 
enzyme is regulated by sigma-like tran- 
scription factors that play a role in pro- 
moter selection in a similar manner to 
the RNA polymerase from E. coli. These 
primary transcripts are processed into 
smaller RNAs, which are further modified 
to generate functional RNAs. Although, in 
general, the RNA-processing mechanisms 
are unknown, they represent an important 
step in the control of chloroplast gene ex- 
pression. Such mechanisms include RNA 
cleavage, stabilization, intron splicing, and 
RNA editing. Some nuclear-encoded pro- 
teins that participate in diverse plastid 
RNA processing have been characterized, 
and most of these appear to belong to 
the pentatricopeptide repeat (PPR) pro- 
tein family that is implicated in many 
crucial functions, including organelle bio- 
genesis and plant development. The PPR 
proteins seem to bind to specific chloro- 
plast transcripts, thus modulating their 
expression with other general factors, and 
also appear to be involved in the control 
of post-transcriptional gene expression in 
chloroplasts, including transcript process- 
ing, stabilization, editing, and translation. 
Efforts are required to identify and study 
interacting enzymes to understand the role 
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of the PPR proteins in post-transcriptional 
activities, such as splicing, stabilization, 
editing, and the translation of diverse tran- 
scripts in chloroplasts [45-47]. 

In the case of translation, chloroplasts 
have 70S ribosomes much like prokary- 
otes, and have also been shown to possess 
Shine—Dalgarno-like sequences. In con- 
trast, chloroplast genes have the charac- 
teristics of nuclear systems, including the 
presence of introns and highly stable mR- 
NAs. Generally, however, the transcription 
rates and steady-state mRNA levels are 
not comparable, which suggests that post- 
transcriptional RNA processing and stabi- 
lization are decisive steps in the control of 
gene expression in chloroplasts. 


3.4 
Regulation of Gene Expression in 
Mitochondria 


In similar fashion to chloroplasts, the 
mitochondria in eukaryotes possess an in- 
dependent genetic system. On average, a 
mitochondrion will include at least one ri- 
bosomal protein gene, together with the 
rRNA and tRNA required for the mi- 
tochondrial translation system. In plant 
mitochondria, genes may be present that 
are responsible for coding the proteins 
involved in the electron transport and 
ATPase complexes. The size of the mi- 
tochondrial genome has been reported to 
be larger, despite only 30 coded proteins 
have been identified using polyacrylamide 
gel electrophoresis. 

Mitochondrial DNA is unusual, as it is 
neither wholly prokaryotic nor eukaryotic 
in nature. Rather, some similarities to bac- 
terial protein synthesis have been found, 
such as a sensitivity to antibiotics, the se- 
quence homology of rRNAs, and the use 
of N-formyl methionine for protein chain 
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initiation. Notably, the diversity of mito- 
chondrial tRNAs and their structures differ 
from those in prokaryotes, and from the 
eukaryotic cytoplasm or chloroplasts. For 
example, the sizes of the mitochondrial ri- 
bosomes range from 55S to 77S, compared 
to 70S for chloroplast ribosomes and 80S 
for cytoplasmic ribosomes. The major dif- 
ference between the mitochondrial genetic 
system and all other systems, however, is 
that the mitochondria employ a slightly 
altered genetic code. Although, in general, 
the genetic code is considered to be uni- 
versal, in animal and yeast mitochondria 
UGA serves as a codon for tryptophan, 
whereas in plant mitochondria it is used as 
a stop codon. In plant mitochondria there 
also appears to be a strong bias towards 
the use of codons ending in T, whereas in 
yeast the preference is for those ending in 
A or T, and in animals for those ending 
in A or C. It is very difficult, therefore, to 
express nuclear or chloroplast genes in the 
mitochondrial system. 

As in the case of chloroplast genes, mito- 
chondrial genes often produce a complex 
set of transcripts. Processing occurs at the 
ends of the tRNAs, which are inserted 
much like punctuation marks at the ends 
of the structural genes. Polyadenylation 
occurs in neither yeast nor plant mitochon- 
dria, and the transcripts do not include 
the entire genome. Comparatively little is 
known about the regulation of gene ex- 
pression in mitochondria; likewise, little 
is known about the splicing mechanism, 
except that splicing is known to depend on 
the RNA secondary structure rather than 
on specific splicing signals, which is un- 
like nuclear and chloroplast RNA splicing. 
On the other hand it is well known that, al- 
though the amount of mitochondrial DNA 
in a plant is less than 1% of the total cel- 
lular DNA, it plays an important role in 


the development and reproduction of the 
plant [48, 49]. 


4 
RNA Splicing 


Most eukaryotic genes have been found 
to include noncoding sequences (introns) 
in addition to coding sequences (exons). 
The introns, which are present in DNA 
and in the primary transcription product 
of the gene (HnRNA), are removed by 
RNA splicing before the mature mRNA is 
transported to the cytoplasm. The number 
of introns varies between the genes; 
for example, the dystrophin gene has 70 
introns, whereas the a-interferon gene has 
no intron. The size of the intron also varies 
from almost 100 to 200 000 nucleotides. 
At least four types of reaction of RNA 
splicing have been identified, namely the 
splicing of nuclear introns, of group I and 
I] introns, and of tRNA introns. Each reac- 
tion carries a change within the individual 
RNA molecule, and therefore is considered 
to be a cis-acting event. In RNA splic- 
ing, only very short consensus sequences 
are required, and these are located as the 
end sequences of the intron, GT-AG. In 
yeast, a branch sequence UACUAAC is 
also required, which is a less-conserved 
sequence in mammals. The ends of the 
introns are identified by RNA—RNA base 
pairing between the HnRNA and uridine- 
rich small nuclear ribonucleoprotein par- 
ticles (snRNPs). As the conserved splice 
site sequences are short, are not precisely 
conserved between introns, and occur fre- 
quently in the primary sequences of many 
HnRNAs, this allows the spliceosomes to 
combine different 5’ and 3’ splice sites 
in the HnRNA to produce several alter- 
natively spliced mRNAs from a single 
nuclear gene. Consequently, due to the 


process of alternate splicing, multiple pro- 
teins with different primary amino acid 
sequences and biological activities may be 
produced from a single gene. The alterna- 
tively spliced mRNAs have been shown to 
be regulated in either temporal, develop- 
mental, or tissue-specific manner in many 
cases. An alternative RNA splice site choice 
has been shown to regulate the expression 
of a somatic sex-determination pathway 
in Drosophila. In this case, the sex-lethal 
and transformer 1 proteins were shown 
to control the maintenance of gender by 
regulating Drosophila gene expression at 
the level of alternative RNA splicing. 

RNA splicing starts with a 5’ splicing 
site, while the formation of a lariat occurs 
by joining the GU end of the intron to the A 
position of the branch sequence, via a 5’, 2’ 
linkage. Subsequently, the 3’-OH end of 
the exon attacks the 3’ splicing site in such 
a way that the ligation of exons occurs, 
releasing the intron as a lariat. Both reac- 
tions involve trans-esterification, in which 
the bonds are conserved. At several stages, 
ATP hydrolysis occurs, most likely to fuel 
the conformational changes occurring in 
the RNA and/or proteins. The lariat for- 
mation is responsible for the 3’ splicing 
site, while nuclear splicing requires the 
formation of a spliceosome, which contains 
various snRNPs and splicing factors. The 
snRNPs recognize consensus sequences, 
and also share some interacting proteins. 
Typically, the U1, U2, and U5 snRNPs 
each contain a single snRNA and several 
proteins, whereas the U4/U6 snRNP con- 
tains two snRNAs and several proteins. 
The U1 snRNP base-pairs with the 5’ splic- 
ing site, U2 base-pairs with the branch 
sequence, and U5 snRNP acts at the 5’ 
splicing site. From U4/U6, there is cleav- 
age of U4, after which U6 base-pairs with 
U2 to create the catalytic center for splic- 
ing. The Group I and Group II introns 
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perform RNA splicing as a self-catalyzed 
property of RNA. In Group I introns, 
the hydroxyl group required for attack at 
the 5’ exon-intron junction is provided 
by a free guanine nucleotide, whereas in 
Group II introns the internal 2’-OH posi- 
tion serves as the source. Although these 
introns also follow the GT-AG rule, they 
form a characteristic secondary structure 
that holds the splice sites in the appropri- 
ate position. tRNA splicing in yeast has 
been shown to involve separate endonu- 
clease and ligase reactions, whereby the 
endonuclease recognizes the secondary 
structure of the pre-tRNA and cleaves both 
ends of the intron. The two halves of the 
tRNAs released by the removal of introns 
can be ligated, using the enzyme RNA 
ligase, in the presence of ATP [1, 50-52]. 


4.1 
Nuclear Splicing 


In eukaryotes, the majority of genes have 
introns. However, because of the presence 
of such introns (as noncoding sequences) 
in the gene there is much discrepancy 
in size between the nuclear genes and 
their corresponding mRNAs. The average 
size and complexity of the nuclear RNA 
(HnRNA) was found to be much greater 
than for mRNA. The HnRNP has also 
been found to be a ribonucleoprotein in 
which HnRNA is bound by proteins, such 
that it resembles a bead connected by a 
fiber. The “beads” are in fact globular- 
shaped RNAs associated with six common 
proteins, Al, A2, B1, B2, C1, and C2, which 
are referred to as core proteins, with sizes 
ranging from 34 to 120 kDa. The exact 
structure of the HnRNP and the function 
of RNAs packaging in the form of beads 
are not clear. 
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The RNA splicing and other post- 
transcriptional changes occur in the nu- 
cleus, the substrate for these processes is 
HnRNP. In this process, the transcript is 
capped at the 5’ end, the introns are re- 
moved, and polyadenylation occurs at the 
3’ end; collectively, these reactions are re- 
ferred to as “RNA processing.” After pro- 
cessing, the RNA is transported through 
the nuclear pores to the cytoplasm, where 
it is available for translation. 

Currently, many types of splicing system 
have been identified (Fig. 10): 


1. Introns are removed from the nuclear 
RNAs, using the spliceosome. This re- 
action requires a large splicing system. 

2. The excision of certain introns is an 
autonomous property of the RNA itself. 
The ability of RNA to act as an enzyme 
is seen in the self-cleavage of viroid 
RNAs, and in the catalytic activity of 
RNase P. 

3. The removal of introns in yeast 
pre-tRNAs involves endonuclease 
and RNA ligase, whose dealings with 
pre-tRNA seem to resemble those 
of the RNA-processing enzymes. A 
critical feature here is the conformation 
of the pre-tRNA. 


Nuclear RNA splicing junctions are in- 
terchangeable, but are read in pairs. There 
is no extensive homology or complemen- 
tarity between the two ends (5’ GU-AG 
3’) of an intron and, as written above, the 
junctions have well-conserved consensus 
sequences. The really high conservation 
is found only within the introns at the 
presumed junctions. The 5’ and 3’ end din- 
ucleotide sequences define the left (or 5’) 
and right (or 3’) splicing sites; these are 
also referred to as donor and acceptor sites. 
Although it has been shown that there is a 
common mechanism for nuclear HnRNA 


splicing, the consensus is not applied to 
the introns of mitochondria, chloroplasts, 
and pre-tRNA introns. 

In order to ensure splicing of the correct 
pairs of junctions, the following two points 
may be applicable: 


1. It may be an intrinsic property of the 
RNA to connect the sites at the ends 
of a particular intron, because of the 
base-pairing involving these regions. 

2. All of the 5’ sites may be function- 
ally equivalent, and all 3’ sites may be 
similarly indistinguishable. The splic- 
ing could follow rules which ensure 
that the 5’ site is always connected to 
the 3’ site, which locates next in the 
RNA. 


The splicing sites are generic; they do not 
have specific individual RNA precursors 
and besides, the apparatus (spliceosome) 
for splicing is not tissue-specific. The 
RNA may be spliced by any cell, and the 
conformation of the RNA will influence 
the accessibility of the splicing sites. The 
reaction does not proceed sequentially 
along the precursor, and the RNA splicing 
is also independent of any modifications 
to the RNA. In vitro, a cut is first made 
at the 5’ end of the intron separating 
the left exon and the right intron—exon 
molecule. In this case, the left exon takes 
the form of a linear molecule, whereas the 
right intron—exon is nota linear molecule. 
The 5’ terminus generated at the left end 
of the intron becomes linked by a 5/-2’ 
bond to the A in the branch site located 
30 nucleotides upstream of the 3’ end of 
the intron. This linkage keeps the intron 
in the form of a structure called a “lariat.” 
Subsequently, cutting at the 3’ end releases 
the free intron in a lariat form, while the 
right exon becomes ligated with the left 
exon. The lariat is then debranched to 


RNase P cleavage 
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provide a linear excised intron, which is 
rapidly degraded [1, 53]. 


4.2 
Splicing Pathways 


Whilst several methods of RNA splicing 
occur in Nature, the type of splicing will 
depend on the structure of the spliced 
intron and the catalysts required for 
splicing. 


4.2.1 Spliceosomal Introns 
Spliceosomal introns often reside in eu- 
karyotic protein-coding genes, and within 


WZ 


ae |P 
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(a—d) Different types of RNA-catalyzed intron splicing reactions. 


the intron a 3’ splice site, 5’ splice site, and 
branch site are required for splicing. The 
5’ splice site or splice donor site includes 
an almost invariant sequence GU at the 
5’ end of the intron, within a larger, less 
highly conserved consensus region. The 
3’ splice site or splice acceptor site termi- 
nates the intron with an almost invariant 
AG sequence. Upstream from the AG, 
there is a region enriched in pyrimidines 
(C and U), or polypyrimidine tract. Up- 
stream from the polypyrimidine tract is the 
branch point, which includes an adenine 
nucleotide. Point mutations in the under- 
lying DNA or errors during transcription, 
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can activate a “cryptic splice site” in part 
of the transcript that usually is not spliced. 
This results in a mature mRNA with a 
missing section of an exon. In this way, 
a point mutation, which usually only af- 
fects a single amino acid, can manifest as 
a deletion in the final protein. 


4.2.2 Spliceosome Formation and Activity 
Many small RNAs have been found in the 
nucleus and cytoplasm of eukaryotic cells; 
these may be referred to as small nuclear 
RNAs (snRNAs) or “snurps’” and small 
cytoplasmic RNAs (scRNAs) or “scrps”, 
respectively. A snRNP generally contains 
one snRNA and about 10 proteins, some of 
which are common in all snRNPs, while 
some are unique to a particular snRNP. 
The common proteins are recognized 
by an autoimmune antiserum (anti-Sm), 
and are considered to be involved in the 
autoimmune reaction. Many snRNPs are 
involved in RNA splicing. The snRNAs 
present in these snRNPs have sequences 
complementary to the 5’ or 3’ splicing 
sites, or to the branching sequence. It 
is considered that base-pairing between 
snRNA and HnRNA or between snRNAs 
plays an important role in splicing. 

The spliceosome consists of many 
snRNPs and many additional proteins that 
often are referred to as splicing factors 
(Fig. 11). The snRNPs are U1, U2, US, and 
U4/U6, and are named according to the 
snRNPs present in the spliceosome. The 
snRNPs present in spliceosome together 
incorporate about 40 proteins, some of 
which may be directly involved in splic- 
ing, while others may have structural roles 
for assembly or for interaction with the 
snRNPs. 

In the U1 snRNP, a region of 5’ termi- 
nal 11 nucleotides that is single-stranded 
and has a stretch which is complementary 
to the consensus sequence at the 3’ site 


of the exon, is considered to be directly 
involved in splicing. The intact U1 snRNP 
can bind to a 5’ splicing site in vitro; only 
the snRNA of U1 cannot bind with the 5’ 
splicing site. The U1 RNP first binds at 
the 5’ splice site, and then also binds to 
the branch site, although how the U1 RNP 
recognizes the branch site is not known. 
The U2 RNA binds to the branch site 
by recognizing the base-pairing interac- 
tion; however, for the binding of U2 RNP, 
a prior binding of U1 RNP is essential. 
Although interaction with U1 snRNP is 
responsible for recognizing the splicing 
site, this does not control the cleavage. Ini- 
tially, the US snRNA binds close to exon 
sequences at the 5’ splice site, but it then 
changes its position to the vicinity of the 
intron. Based on the results obtained, it 
has been suggested that the snRNA com- 
ponents of snRNPs interact both among 
themselves and with the substrate RNA by 
base-pairing interactions, and that these 
interactions allow for changes in structure 
that may bring reaction groups into oppo- 
sition, thereby creating a catalytic center 
[54, 55]. 

A series of loci containing genes which 
may potentially code for splicing factors 
were originally thought to code RNA, but 
are now known to encode pre-RNA pro- 
cessing proteins (PRPs). Some of the PRPs 
are components of snRNPs, while others 
may function as independent factors. One 
protein, PRP16 (an ATP-dependent heli- 
case), has been shown to be involved in 
the second catalytic step of RNA splicing. 
Another protein, PRP22 (yet another ATP- 
dependent helicase), has been shown to 
be required to release mRNA from the 
spliceosome [56, 57]. 


4.2.3 Self-Splicing 
Self-splicing occurs rarely in RNA; this 
type of RNA is referred to as a ribozyme. 
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Pre-mRNA splicing using the spliceosome. Spliceo- 


some formation involves the interaction of a component that 
recognizes the consensus sequences. U1, U2, U3, U4, US, 


and U6 are different small ribonuclear proteins. 


Two types of self-splicing intron have been 
identified, termed Group I and Group II. 
These introns perform splicing similar to 
the spliceosome, but without any require- 
ment for protein. Such similarity indicates 
that the Group I and II introns may be 
evolutionarily related to the spliceosome. 
Self-splicing may have existed in an “RNA 
world” that was present before protein. 
Although, tRNA splicing requires other 
enzymes (viz. endonuclease and RNA lig- 
ase), it has been shown that only the RNA 


part of ribonuclease P (an enzyme pro- 
tein having RNA in its structure) may cut 
the pre-tRNA molecule at a specific site. In 
general, splicing is a cis-reaction, although 
trans-splicing has also been reported [1, 4]. 


Group | and Group II Introns Group I 
introns (where the hydroxyl group is pro- 
vided by a free guanine nucleotide) are 
more common than Group II introns 
(where the hydroxyl group is provided by 
an internal 2’-OH position). Both, Group 
I and IJ introns can perform the splicing 
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by themselves, without a need for enzymic 
activities to be provided by the proteins. 
The Group II mitochondrial introns have 
splicing sites that resemble the nuclear in- 
trons, and they are also spliced by the same 
mechanism as nuclear HnRNA. Two trans- 
esterification reactions can be performed 
by the Group II introns although, as the 
number of phosphodiester bonds is con- 
served in the reaction, there is no need 
for an external energy supply — which 
may have been an important feature in 
the evolution of splicing. In autocatalytic 
splicing, the RNA folds into a specific 
conformation or series of conformations, 
and splicing occurs in cis-conformation. 
In contrast, the snRNAs act in trans-form 
upon HnRNA. 

Previously, Cech and colleagues, while 
working with Tetrahymena thermophila for 
the first time, showed that RNA molecules 
were capable of self-RNA splicing, without 
the involvement of any protein. This led 
Cech et al. to coin the term ribozyme, 
meaning RNA as an enzyme, and they 
subsequently showed that RNA could 
indeed catalyze its own splicing. In the 
self-splicing of RNA by T. thermophila (as 
shown in Fig. 12), the enzymes act on 
molecules other than on themselves - 
hence the term ribozyme. The same group 
later showed that ribozyme could act on a 
slightly different form of the same RNA 
and was, therefore, an enzyme in the true 
sense. It was also suggested that, because 
RNA can serve as both a catalyst and an 
informational molecule, at the time when 
life on Earth first began RNA may have 
functioned alone, in the absence of DNA 
or proteins [4]. 


4.2.4 tRNA Splicing 

All the genes that code for tRNAs do 
not have noncoding sequences (introns) 
in their structures. In fact, only about 40 


of almost 400 nuclear tRNA gene products 
in yeast are known to be interrupted, with 
only one intron having been found present 
just one nucleotide beyond the 3’ side of 
the anticodon. The size of these introns 
varies from 14 to 46 nucleotides, and 
no consensus sequence has been found 
within them. RNA splicing in the primary 
transcript of the tRNA gene may occur in 
a different fashion, there being separate 
cleavage and ligation reactions (Fig. 13). 
The same mode of splicing as occurs 
in yeast has also been reported to occur 
in the nuclear tRNA gene products of 
plants, amphibians, and mammals. All of 
the introns in the tRNA gene products 
have a sequence which is complementary 
to the anticodon of the tRNA; this is an 
alternative conformation for the anticodon 
arm, in which the anticodon is base-paired 
to form an extension of the usual arm. The 
splicing of a tRNA gene product depends 
primarily on the recognition of a common 
secondary structure in tRNA, although to 
date no common sequences within the 
introns have been reported. 

In tRNA gene product splicing, there 
is a cleavage of the phosphodiester bond, 
assisted by an endonuclease, but the hy- 
drolysis of ATP is not required as an energy 
source. Subsequently, an enzyme — RNA 
ligase — is required for bond formation, 
with the ligase-catalyzed reaction requir- 
ing energy via the hydrolysis of ATP. The 
generation ofa 2’, 3'-cyclic phosphate bond 
also occurs during splicing in plants and 
mammals. On cleavage of the phospho- 
diester bond as a result of the endonu- 
clease reaction, there is first a generation 
of 2’, 3’-cyclic phosphate and 5’-OH ter- 
mini, after which the cyclic phosphate is 
opened to form 3’-OH and 2’-phosphate 
groups, and the 5’-OH is phosphorylated. 
Following release of the intron, the tRNA 
half-molecules are folded into a tRNA-like 
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Fig. 12 Self-cleavage of the rRNA intron. The 
intron of pre-rRNA of Tetrahymena is cleaved 
by autocatalytic splicing. (i) Folding of RNA 
to act as ribozyme; (ii) A hydroxyl group at- 
tached to GTP attacks the 5’ end phosphate 
of the intron, such that the phosphodiester 
bond between exon and intron is broken and 
a new bond is formed between the guanine 
nucleotide and the intron; (iii) The hydroxyl 
group at the left exon attacks at the 3’ end 


structure that has a 3’-OH, 5’-phosphate 
which is sealed by the action of RNA 
ligase. 

Ribonuclease P, a tRNA-processing en- 
zyme that is found in both bacteria and 
eukaryotes, is a nucleoprotein. It was noted 
earlier that both the RNA and protein com- 
ponents are required for the nuclease to 
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of the intron. The bond is broken and the ex- 
ons are ligated, releasing the intron; (iv) A 
similar reaction enables the intron to form a 
circle, snipping 15 nucleotides from its end 
in the process. The circle opens into a linear 
molecule, and then closes with the removal 
of four nucleotides. The final open form is 
called L-19 IVS (linear minus 19 intervening 
sequence). 


cut the tRNA precursor at a specific point. 
However, it was shown subsequently that 
only the RNA part of ribonuclease P can 
cut the pre-tRNA molecule at a specific 
site. In contrast, the protein part of the 
enzyme alone could not do this, which in- 
dicates that RNA catalyzes the splicing of 
the pre-tRNA molecule [4, 58-60]. 
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Fig. 13 Splicing of tRNA. This involves cleav- 
age of the exon-intron boundaries by endonu- 
clease to generate 2’-3’-cyclic phosphate and 
5’-OH termini. The cyclic phosphate is opened 
to generate 3’-OH and 2’-phosphate groups. 


4.3 
cis- and trans-Splicing Reactions 


Splicing occurs generally as an intramolec- 
ular cis-reaction in which a controlled 
deletion of the introns takes place. When 
the introns are removed from the RNA 
molecule, this allows the exons of the 
RNA molecule to be spliced together. 
An inter-molecular splicing also occurs, 
whereby the exons present in different 
RNA molecules can be spliced (ligated) 
into one molecule; these reactions, which 
are referred to as trans-splicings (Fig. 14), 
are rare and never occur between pre- 
mRNA transcripts of the same gene. 

The trans-splicing occurs in vivo un- 
der certain special conditions. In try- 
panosomes, a 35-nucleotide leader se- 
quence is present at the end of many 
mRNAs; such RNA, which is known 
as a spliced leader (SL) RNA, donates 
the 5’ exon required for trans-splicing. 
The SL RNAs that are found in certain 
species of trypanosomes and nematodes 
have common features; notably, they fold 
into a common secondary structure hav- 
ing three stem loops and a single-stranded 


The 5’-OH is phosphorylated and, after releas- 
ing the intron, the tRNA half molecule folds 
into a tRNA-like structure that now has a 3’- 
OH and 5/-phosphate, which is then joined by 
RNA ligase. 


region, resembling snRNPs. Typically, try- 
panosomes possess the U2, U4, and U6 
snRNAs, but not the U1 and U3 snRNAs. 
The SL RNA functions without recogni- 
tion of the 3’ splicing site, and depends 
directly on RNA. 

Some chloroplast genes are also trans- 
spliced. For example, the psa gene of 
the Chlamydomonas chloroplast has three 
widely separated exons, with Exon 1 being 
located 50 kb away from Exon 2, and 
Exon 2 being 90 kb away from Exon 3. 
Although many other genes lie between 
these exons, they cannot be transcribed 
as a common transcript, since Exon 1 is 
in reversed orientation from Exon 2 and 
Exon 3. In addition, as several other genes 
are required for one or the other of trans- 
splicing reactions, the process of splicing 
this mRNA together is quite complex [1]. 


4.4 
Alternate Splicing 


When a single gene provides more than 
one mRNA sequence, the situation is 
referred to as alternate splicing. In some 
cases, the use of a different start point 
(5’ splicing site) and/or 3’ splice site will 
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Fig. 14 Schematic diagram showing cis-splicing and trans-splicing reactions. 


alter the pattern of splicing although, 
as noted above, this occurs only when 
dinucleotide sequences are present at the 
ends of the introns. As a result, it is 
possible that the same portion of the 
gene may act as an exon in one mRNA, 
and as an intron in another mRNA. 
Alternate splicing also occurs following 
the substitution, addition, or deletion of 
internal exons. For example, if a gene has 
exon number 1, 2, 3, 4, 5, 6, 7, and/or 
8, these may be ligated in different ways, 
such as 1, 2, 3, 4; 2, 3, 5, 7; 1, 2, 4, 6 
1, 2, 3, 7; and 1, 3, 5, 6. These multiple 
products are created in the same cell, but 
in other cells the process may be regulated 
so that particular splicing patterns will 
occur only under certain conditions. In 
some cases, proteins that intervene to 
bias the use of alternate splice sites have 
been identified. In the recently sequenced 
human genome, the number of genes has 
been estimated to be in the range of 25 
000, whilst the total number of proteins 
at different stages of development has 
been estimated at almost 500 000. This 


situation has been explained on the basis of 
alternate splicing, it having been estimated 
that, under different conditions, almost 20 
proteins may be coded from one gene. 

In Drosophila melanogaster, native splic- 
ing may be caused by mutations in the 
genes. In the case of the T/t antigens, the 
5’ site for the T antigen removes a ter- 
mination codon present in the t antigen 
mRNA - which is why the T antigen is 
much larger than the t antigen. In E1A 
transcripts, one of the 5’ sites is connected 
to the last exon in a different open-reading 
frame, which then causes a change in the 
C-terminal region of the protein. 

Drosophila flies with one X chromosome 
and two sets of autosomes (A) are male, 
while those with an equal number of 
chromosomes and sets of autosomes are 
female. The X : A ratio activates the so- 
called Sxl gene, which exerts a positive 
control on its own expression as well as 
that of three other genes. This provides a 
mechanism so thata single X chromosome 
of males is transcribed into as much RNA 
as the two X chromosomes of females. The 
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Sxl gene produces either male- or female- 
specific spliced transcripts which have 
identical 5’ ends but differ in the presence 
or absence of a small male-specific exon 
that inserts a stop codon into the transcript. 
The protein encoded by the Sxl gene has 
an 80-amino acid RNA-binding domain; 
the same motif has been reported in many 
other RNA-binding proteins and perhaps 
provides a clue as to the control of its own 
processing, and that of further proteins in 
the regulatory cascade. 

Another important example of alternate 
splicing is the tropomyosin genes 
of Drosophila and vertebrates. The 
tropomyosins are a family of closely 
related proteins that mediate the 
interactions between actin and troponin, 
and help in the regulation of muscle con- 
traction. Different tissues — both muscle 
and non-muscle - are characterized by 
the presence of different tropomyosin 
isoforms. It is generally considered that 
many of these isoforms are produced from 
the same gene, via alternate splicing [1]. 


5 
Role of microRNAs (miRNAs) in the 
Regulation of Gene Expression 


During recent years, it has been argued 
that a majority of RNA molecules rep- 
resent the principal actors in the largely 
unexplored networks of gene regulation. 
Indeed, it has been suggested that an un- 
derstanding of RNA-based gene regulatory 
networks might provide the key to explain- 
ing the difference between a yeast cell 
and a fruitfly, and/or between a fruitfly 
and a human. According to John Mattick 
(University of Queensland in Brisbane, 
Australia), complexity is hidden in the 
noncoding output of the genome. Re- 
cently, a new class of noncoding RNAs has 


been reported — microRNAs (miRNAs) — 
which have been predicted to regulate the 
production of proteins from other genes. 
The genomes of higher organisms may 
have up to almost 98% noncoding DNA se- 
quence, much of which is never read at all, 
although some of it may be transcribed to 
RNA; in this case, it is considered that the 
genes contain noncoding introns between 
the exons. When a HnRNA is transcribed 
from a gene, the introns are cut out and 
the exons ligated. Many sequences outside 
protein-coding genes are also transcribed 
into RNA [61]. According to Mattick, non- 
coding RNAs interact with one another, 
with mRNAs, with DNA, and also with 
proteins, to form networks that can reg- 
ulate gene activity with almost infinite 
potential complexity. This is a very con- 
vincing suggestion, as a straightforward 
comparison of gene numbers cannot ex- 
plain the difference between simple and 
complex organisms. As humans appear 
not to have a much larger number of 
genes than do simple organisms (e.g., the 
nematode Caenorhabditis elegans), it would 
appear that higher organisms bolster their 
complexity by ‘‘mixing and matching” the 
protein domains, so as to generate new 
combinations (although other ploys may 
be required to explain the complexity of 
humans and other vertebrates). Mattick 
compared the RNA-based networks with 
a computer, as the controlling software of 
which allows the processor to be easily re- 
configured for a new task by changing the 
control codes. Evidence became available 
that, in human chromosomes, many more 
(up to 10-fold) sequences were being tran- 
scribed than was predicted, and therefore, 
the role(s) of the noncoding sequences 
is (are) of great importance. Whilst the 
noncoding sequences have been shown 
to be common, it cannot be proved on 


this basis that most are involved in net- 
works of gene regulation. The subsequent 
discovery of miRNAs strengthened this 
concept, however. With a length of ap- 
proximately 22 nucleotides, the first genes 
coding for miRNAs, lin-4 and let-7, were 
identified in C. elegans. The miRNAs are 
known to be cut from longer hairpin- 
shaped RNAs that are transcribed from 
lin-4 and let-7, and bind to specific target 
mRNAs, thus blocking their translation 
to proteins. Evidence has also been pro- 
vided for the presence of miRNAs in a 
diverse range of species, including verte- 
brates and plants. An intriguing link to 
a gene-silencing mechanism — RNA in- 
terference (RNAi) — that is considered to 
defend cells from viruses and jumping 
genes, has also been identified. The role of 
RNAi begins when the cell detects an un- 
usual RNA with paired strands; an enzyme 
known as Dicer then cleaves the offending 
double-stranded RNA into fragments of 
21-25 nucleotides that are referred to as 
small interfering RNAs (siRNAs). Single 
strands from these fragments then bind 
to further copies of the original RNA, tar- 
geting them for destruction. RNAi has 
also been used experimentally to silence 
a cell’s own genes by adding double- 
stranded RNA sequences that match the 
gene’s mRNA [62-64]. 

Previously, miRNAs have been consid- 
ered as being regulators for all types of 
biological systems, one such role being 
to convert proliferating oligodendrocyte 
precursor cells into mature, myelinating 
oligodendrocytes. As noted above, Dicer 1 
was found to be involved in the process- 
ing of larger RNA precursors into smaller, 
active, 20- to 24-nucleotide miRNAs, and 
subsequent knockout studies conducted 
with Dicer 1 in an oligodendrite precursor 
cell lineage in mice led to the creation of 
animals without myelin. Along the same 
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lines, attempts have been made to correlate 
miRNAs with the demyelination of den- 
drocytes, as occurs in Alzheimer’s disease 
[64-66]. 


6 
Chromatin Structure and the Control of 
Gene Expression 


As noted above, two forms of chromatin 
structure have been identified, namely het- 
erochromatin and euchromatin, the origi- 
nal designation being based on cytological 
observations of how darkly the two regions 
could be stained. Heterochromatin is more 
densely packed than euchromatin, is often 
located close to the centromeres of the 
chromosomes, and is generally transcrip- 
tionally inactive. In contrast, euchromatin 
is more loosely packed and transcription- 
ally active. 

Whilst it is possible to predict the 
transcriptionally active regions of chro- 
matin, based on cytological assays, more 
modern investigations have defined the 
molecular basis for chromatin  struc- 
ture in the context of the regulation of 
gene expression. Two primary mecha- 
nisms exist that alter chromatin struc- 
ture and, consequently, affect gene ex- 
pression: (i) the methylation of cytidine 
residues in the DNA, located in the din- 
ucleotide CG (this is most often written 
as a CpG dinucleotide); and (ii) histone 
modification(s). 

While previous observations have sug- 
gested that over 90% of methyl-C is located 
in the dinucleotide, CpG, not all CpG 
dinucleotides will have a methylated C 
residue. It has also been shown that the 
promoter regions of genes contain 10- to 
20-fold more CpGs than the remainder 
of the genome. In general, there is an 
inverse relationship between methylation 
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and transcription; for example, when cells 
undergo differentiation, the transcription- 
ally active genes have been shown to 
exhibit a reduction in methylation level 
compared to that prior to activation, and 
that such under-methylation persists when 
the transcription has ceased. The role of 
DNA methylation in the control of gene 
transcription was first demonstrated by 
treating cells in culture with the cytidine 
analog, 5-azacytidine (5-azaC), which has 
a nitrogen instead of a carbon at position 
5 of the pyrimidine ring and so cannot 
serve as a substrate for methylation. When 
fibroblasts were grown in the presence of 
5-azaC, and then differentiated into my- 
oblasts, such differentiation was shown to 
have resulted from an under-methylation 
and activation of the MyoD gene (a master 
regulator of muscle differentiation). 

The methylation of DNA is catalyzed by 
several different DNA methyltransferases. 
The critical role of DNA methylation in 
controlling developmental fate was ob- 
served in mice by inactivating either DNA 
methyltransferase 3a or 3b, whereby the 
loss of either gene resulted in animal 
death shortly after birth. When cells divide, 
the newly formed DNA will contain one 
strand of parental DNA, and one newly 
replicated DNA strand. However, if the 
DNA contains methylated cytidines in the 
CpG dinucleotides, then the newly repli- 
cated DNA strand should be methylated in 
order to maintain the parental pattern of 
methylation. Such “maintenance” methy- 
lation is catalyzed by DNA methyltrans- 
ferase 1 (also referred to as maintenance 
methylase). 

Today, many proteins have been iden- 
tified that bind to methylated, but not 
unmethylated, CpGs. One such protein 
example is methyl CP binding pro- 
tein 2 (MeCP2) which, when bound to 
methylated CpG dinucleotides, causes the 


DNA to take on a closed chromatin struc- 
ture, with the subsequent repression of 
transcription. The ability of MeCP2 to bind 
methylated CpGs is controlled by its phos- 
phorylation and dephosphorylation states. 
Although, phosphorylated MeCP2 has a 
lesser affinity for methylated CpGs, its 
binding leads to the DNA acquiring a 
more open chromatin state. The impor- 
tance of MeCP2in regulating chromatin 
structure and, consequently, the transcrip- 
tion process, has been confirmed by the 
fact that a deficiency in this protein results 
in Rett syndrome. This neurodevelopmen- 
tal disorder occurs almost exclusively in 
females, and manifests as mental retar- 
dation, seizures, microcephaly, arrested 
development, and loss of speech. 

Those histone proteins that remain 
bound to DNA also undergo a number 
of modifications that affect the chromatin 
structure. In fact, it has been shown 
that if the histone is acetylated then the 
chromatin structure will be more open, 
and such modified histones will be lo- 
cated in regions of transcriptionally active 
chromatin. A direct correlation between 
histone acetylation and transcriptional ac- 
tivity has been confirmed by the fact 
that protein complexes, known previously 
as transcriptional activators, demonstrate 
histone acetylase activity, whereas tran- 
scriptional repressor complexes possess 
histone deacetylase activity. Other pro- 
teins that interact with acetylated lysines in 
histones together form a more open chro- 
matin structure. Those proteins that bind 
to acetylated histones incorporate a so- 
called bromodomain, which is composed of 
a bundle of four a-helices and is involved 
in protein—protein interactions in several 
cellular systems, in addition to acetylated 
histone binding and chromatin structure 
modification. 


Both, acetylation and methylation in 
histones have been shown to affect chro- 
matin structure, although no direct corre- 
lation between histone methylation and 
a specific effect on transcription has 
yet been observed. The methylation of 
histone H4 on arginine at position 4 
promotes an open chromatin structure, 
and consequently accelerates transcrip- 
tional activation. The methylation of hi- 
stone H3 on lysine at positions 4 and 
79 has also been shown to acceler- 
ate transcriptional activation. In contrast, 
the methylation of histone H3 on ly- 
sine at positions 9 and 27 has been 
shown to result in transcriptionally inac- 
tive genes. The binding of some specific 
proteins on methylated histones may re- 
sult in the formation of a more compact 
chromatin. Those proteins that bind to 
methylated lysines present in histones 
incorporate a so-called “chromodomain”, 
which consists of a conserved stretch 
of 40-50 amino acids and is found 
in many proteins involved in chromatin 
remodeling complexes. Chromodomain 
proteins are also found in the RNA- 
induced transcriptional silencing (RITS) 
complex, which involves the siRNA- 
and miRNA-mediated downregulation of 
transcription. 

The histone proteins may also be mod- 
ified by binding a small protein, ubiq- 
uitin, though this occurs only with hi- 
stones H2A and H2B (typically, only a 
small percentage of histone H2A is ubiq- 
uitinated). Whilst ubiquitinated H2A is 
involved in the repression of transcrip- 
tion, ubiquitinated histone H2B causes 
the stimulation of gene expression. The 
ubiquitinated histone H2B has also been 
shown to promote the methylation of hi- 
stone H3 at lysine at positions 4 and 
79 such that, in turn, the methylated 
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histone H3 promotes an open chromatin 
structure. 

The phosphorylation of histones has 
also been reported, based on outside sig- 
nals such as growth factor stimulation, 
or stress inducers such as heat shock. 
The binding of phosphorylated histones 
causes the genes to become transcription- 
ally active, an effect that becomes apparent 
in patients with Coffin—Lowry syndrome, 
a disease which results from defects in 
the RSK2 gene that encodes the histone- 
phosphorylating enzyme. Coffin—Lowry 
syndrome, a rare form of X-linked men- 
tal retardation, is characterized by skeletal 
malformations, growth retardation, hear- 
ing deficit, paroxysmal movement disor- 
ders, and cognitive impairment in affected 
males [29]. 


7 
Epigenetic Control of Gene Expression 


Originally, the term ‘‘epigenetics’” was 
coined by Conrad Waddington in 1939, 
to define the unfolding of the genetic pro- 
gram during development. In addition, the 
term epigenotype was coined to define“... 
the total developmental system consisting 
of interrelated developmental pathways 
through which the adult form of an or- 
ganism is realized.’”’ Nowadays, the term 
epigenetics is used to define the mecha- 
nism by which changes in the pattern of 
inherited gene expression occur in the ab- 
sence of any alterations or changes in the 
nucleotide composition of a given gene. 
Epigenetics can also be explained as being 
“_., in addition to changes in genome 
sequence.” 

It may help to explain epigenetics 
through the example of a fertilized egg 
which, at the moment of fertilization is 
totipotent; that is, as the egg divides the 
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daughter cells will ultimately differenti- 
ate into all of the different cells of the 
organism. The only differences between 
the various cells of the resultant organ- 
ism are the consequences of differential 
gene expression; they are not due to any 
differences in the sequences of the genes 
themselves. 

To date, several different types of epi- 
genetic event have been identified, among 
which DNA methylation is likely to be 
the most important for controlling and 
maintaining the pattern of gene expres- 
sion during development. Other DNA- 
modifying events that are also known to 
affect the epigenetic phenomenon include 
the acetylation, methylation, phosphoryla- 
tion, ubiquitylation, and sumoylation of 
histone proteins. Consequently, the same 
events that affect chromatin structure can 
be considered also as epigenetic events. 
Notably, the control of gene expression 
by siRNAs is also considered to be an 
epigenetic event. 

Epigenesis plays an important role in 
the regulation and maintenance of gene 
expression, and may result in many dif- 
ferentiation states of cells within an or- 
ganism. Recently acquired evidence has 
demonstrated a connection between epi- 
genetic processes and diseases, the most 
significant of which is the link between epi- 
genesis and cancer (epigenesis has been 
suggested as a contributory factor in many 
different types of cancer). In particular, 
a correlation has been observed between 
changes in the methylation status of the 
tumor suppressor genes and the develop- 
ment of many types of cancer. Epigenetic 
effects on immune system function have 
also been reported, as has a correlation 
between epigenetic processes and mental 
health [29]. 


8 
Gene Regulation by Hormonal Action 


It has been shown that signals originate 
from various glands and/or secretory cells 
that stimulate the target tissues or cells 
to carry out dramatic changes in their 
metabolic patterns, including altered pat- 
terns of differentiation. As peptide hor- 
mones are generally larger molecules, and 
are generally unable to enter the cell, they 
exert their effects by binding to cell-surface 
receptors, with subsequent activation of 
the protein enzyme transcription factors 
via a mechanism of phosphorylation. 

In contrast, steroid hormones (e.g., 
estrogens) are smaller molecules that can 
readily penetrate the plasma membrane. 
Following entry, these molecules become 
tightly bound to specific receptor proteins 
that are present only in the cytoplasm of 
the target cells. 

Hormone-receptor protein complexes 
may activate the transcription of specific 
genes in two different ways: 


e The hormone-receptor protein com- 
plex activates the transcription of tar- 
get genes by binding to specific DNA 
sequences present in the cis-acting reg- 
ulatory regions of genes. 

e The hormone-receptor protein com- 
plex interacts with specific nonhistone 
chromosomal proteins, after which the 
complex stimulates transcription of the 
correct genes. 


In the past, it has been considered 
that nonhistone chromosomal proteins 
play an important role in the regula- 
tion of gene expression in eukaryotes. 
However, further evidence suggests that 
the hormone-receptor protein complexes 
may activate gene expression by interact- 
ing directly with specific DNA sequences 


present within the enhancer or promoter 
regions, that regulate transcription of the 
target genes [67, 68]. 

In addition, the possibility exists that 
histone modifications or nonhistone 
chromosomal proteins are involved in 
some aspects of hormone-regulated gene 
expression. 


9 
Post-Transcriptional Regulation of mRNA 


Although the regulation of gene expres- 
sion in eukaryotes at the level of initiation 
of transcription is considered to be very 
important, regulation at the level of post- 
transcription has also been noted in many 
cases. Although capping at the 5’ end of 
the eukaryotic mRNA is considered essen- 
tial, polyadenylation at the 3’ end has not 
been identified in all mRNAs. Whether 
the inhibition of polyadenylation is used 
specifically to block the expression of par- 
ticular genes is not known, although some 
genes have multiple putative polyadenyla- 
tion sites that may be used for alternate 
splicing (i.e., the formation of more than 
one mRNA from one gene). The choice 
of polyadenylation site may also vary dur- 
ing the development of a cell, with the 
switching of splicing patterns occurring in 
a developmentally significant manner. 
Polyadenylation does not occur only at 
the extreme 3’ end of the mRNA; rather, 
between 10 and 30 nucleotides may be 
transcribed that precede the polyadeny- 
lation signal, which has the sequence 
5'AAUAAA3’, or a variant of it. These 
terminal nucleotides are cleaved with the 
assistance of an endonuclease, thereby 
producing an intermediate 3’ end to which 
the polyA tail is subsequently attached 
by the enzyme polyA polymerase. For 
polyadenylation, there is a requirement for 
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a specificity factor that also recognizes the 
5’AAUAAA3’ sequence. This specificity 
factor incorporates three subunits which, 
together, will bind specifically to RNA con- 
taining the sequence 5’AAUAAA3’. The 
polyA polymerase first synthesizes almost 
10 residue oligo-As at the 3’ end of the 
mRNA, in the presence of a specificity 
factor; subsequently, this oligo-A tail is 
extended to almost 200 residues in the 
presence of another factor which recog- 
nizes the oligo-A tail and directs polyA 
polymerase to catalyze its extension. As 
noted above, the polyadenylation of mRNA 
is not essential for further translation; 
rather, it is considered that it may affect 
the stability of the mRNA in the cell. The 
polyA tail is associated with a particular 
protein termed the polyA binding protein 
(PABP); it is believed that the binding of 
polyA with PABP is essential to protect 
mRNA against degradation by nucleases 
[1, 69, 70]. 

As the stability of mRNA may be 
regulated in the cytoplasm, this may result 
in changes in its concentration. In fact, 
it has been found that estrogen not only 
induces transcription of the vitellogenin 
gene but also increases the stability of its 
mRNA in the cytoplasm, increasing its 
half-life from 16 h to 300 h. 

Among the eukaroyotes, all mRNAs 
have been shown to possess a 5’ cap. Al- 
though the exact significance of capping is 
unclear, it is thought to serve as a recogni- 
tion point for the attachment of a ribosome 
at the outset of translation. This is con- 
sidered equivalent to the Shine-Dalgarno 
sequence (GGAGGC), which is found in 
prokaryotes and is the sequence to which 
the small subunit of the ribosome attaches 
in order to commence protein biosynthe- 
sis. The ribosome recognizes the cap struc- 
ture as its binding site and, after becoming 
attached, migrates along the mRNA until 
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it reaches the initiation codon. Those mR- 
NAs that are translated on cytoplasmic 
ribosomes have also been shown to be 
capped, but no capping has been identified 
on mitochondrial and chloroplast mRNAs. 

In eukaryotes, a mRNA after  splic- 
ing at the 5’ end has the structure 
5’pppPuNp----3’, where Pu is a purine 
residue, N is the sugar component of 
the nucleotide, and p represents a phos- 
phate group. However, mature mRNA 
(after post-transcriptional changes) has at 
the 5’ end 5’-7-mGpppPuNp-----3’, where 
7mG (7-methyl guanosine) is attached af- 
ter transcription and is known as a cap. 
During capping, cleavage of the terminal 
phosphate group of the first nucleotide 
occurs, catalyzed by a phosphohydrolase. 
Subsequently, a guanylyl residue is trans- 
ferred to the 5’ end from GTP by the 
enzyme, guanylyl transferase, and there- 
after modified to a 7-methyl guanylyl 
residue by the enzyme mRNA, guanylyl- 
7-methyltransferase in the presence of 
S-adenosyl methionine (SAM), which acts 
as a methyl donor. In this case, the newly 
added guanylyl residue is in the reverse 
orientation compared to other nucleotides 
present in the mRNA. A cap containing 
a single methyl group is known as cap 0; 
however, if there is an addition of another 
methyl group on the second nucleotide 
(which was in fact the first nucleotide in 
the original mRNA), this is referred to as 
cap 1 (though this occurs only if it is an 
adenine nucleotide). The methyl group is 
added on the N° position of the adenine 
nucleotide. 

In some cases, another methyl group 
may be added to the third nucleotide; 
the substrate for this reaction is cap 1 
mRNA, and the acceptor of the methyl 
group is the ribosyl moiety at the 2’ posi- 
tion, and this is referred to as cap 2. This 


reaction is catalyzed by the enzyme 2’-O- 
methyltransferase, while the methyl group 
donor (SAM) is unchanged. The number 
of caps is considered characteristic of the 
organism, while a low frequency of in- 
ternal methylation (1in 1000 nucleotides) 
is also known to occur in the mRNA of 
higher eukaryotes [1]. 

In prokaryotic mRNA, _ post-tran- 
scriptional changes do not generally 
occur. Rather, because of an absence of 
compartmentation there may be a coupled 
translation whereby, as soon as mRNA 
biosynthesis occurs (or is in progress), 
it may bind with the ribosome to begin 
translation. 


10 
Transport of Processed mRNA to the 
Cytoplasm 


Following any post-transcriptional modifi- 
cations, the matured mRNA is transported 
from the nucleus in a very rapid process 
and, on entering the cytoplasm, becomes 
bound to the cytoplasmic ribosomes in 
readiness for translation. The latter pro- 
cess occurs within only 1-5 min after the 
mRNA has left the nucleus. It has been 
suggested that specific proteins exist to 
assist the transportation of mature mRNA 
from the nucleus, though the exact process 
involved is not presently understood. 
Evidence indicates that the mRNA trans- 
port process is not restricted to the simple 
passage of mRNA through the nuclear 
pore complex, which spans the nuclear 
envelope; rather, it is embedded into the 
gene-expression pathway. During tran- 
scription, the message is capped, spliced, 
and polyadenylated, while the mRNA ex- 
port factors are loaded onto the nascent 
transcript. This maturation and assem- 
bly of the mRNA to form a messenger 


ribonucleoprotein particle (mRNP) is con- 
trolled by nuclear surveillance systems; 
the nuclear exosome and the MIp1—2 sys- 
tem combine to prevent the escape of 
aberrant transcripts to the cytoplasm. As 
a consequence, only correctly assembled 
mRNPs are transported through the nu- 
clear pore to the cytoplasm by the mRNA 
export receptor Mex67-Mtr2/Tap—p15, 
which is attached to the mRNA by in- 
teraction with the mRNP-bound transcrip- 
tion export (TREX) complex and splicing 
reporter (SR) proteins [71]. 


11 
Regulation of Gene Expression at the Level 
of Translation 


The majority of the regulation of gene 
expression takes place at the level of 
transcription. The production of mRNA 
involves many steps, several of which 
— such as promoter utilization, RNA 
splicing, and polyadenylation — are known 
to be regulated. Whilst pre-mRNA stability 
and the transport of mRNA from the nu- 
cleus to the cytoplasm provide a very rapid 
control of gene expression, on occasion the 
level of translation may be manipulated 
by changing the essential components 
of the translational machinery of the cell. 
In this regard, phosphorylation of the 
ribosomal components (particularly 5S 
rRNA in the 40S ribosomal subunit) has 
been correlated with higher polysome 
levels in the presence of different growth 
factors in mammalian cells. A similar 
situation occurs in the brine shrimp egg 
where, upon fertilization, a previously 
absent translational initiation factor that is 
involved in polysome formation suddenly 
appears [1, 72]. 

In mammalian reticulocytes the con- 
trol of protein synthesis by hemin is 
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mediated via the formation from a 
presynthesized precursor (prorepressor) 
of a potent inhibitor of a polypeptide chain 
initiator, the hemin controlled repressor 
(HCR). Despite these cells having lost their 
nuclei, they retain high levels of stable 
mRNAs that encode mostly hemoglobin 
chains. In reticulocyte lysates, protein syn- 
thesis occurs at high rate but declines 
rapidly in the absence of hemin. Within 
the cells, hemin synthesis occurs in the 
mitochondria, but these are absent from 
the lysate. In fact, HCR is activated in the 
absence of heme, but inhibited in its pres- 
ence. Although the mode of action of HCR 
was a mystery for many years, it has now 
been shown to act as a specific kinase for 
phosphorylation of the w subunit of trans- 
lation initiation factor 2. The presence of 
the eukaryotic initiation factor 2 (elF2), 
GTP-GDP exchange cycle leads to the 
phosphorylation of even a small fraction of 
elF2 being sufficient to block the initiation 
of protein synthesis. Indeed, it appears that 
all of the eIF2B (which is present in lesser 
amounts than eIF2) is sequestered into 
elF2—elF2B complexes, such that it is no 
longer available to recycle the remaining 
unphosphorylated eIF2 [1, 73]. 

A translational inhibitor, which is 
present in Friend leukemia cells and 
has been characterized as a heat-labile, 
sulfhydryl reagent-insensitive protein of 
molecular weight almost 214 kDa, inhibits 
the initiation of protein synthesis by 
preventing the initiation factor-dependent 
binding of methionyl-tRNA to the 40S 
ribosomal subunit. However, this does not 
interfere with the formation of a ternary 
complex between eIF2, methionyl-tRNA, 
and GTP. Rather, the inhibitor functions 
as a protein kinase which phosphorylates 
the a subunit of eIF2, and has been 
considered analogous to the HCR of 
reticulocyte cells [1, 74]. 
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A phosphoprotein phosphatase enzyme 
capable of releasing the phosphate group 
from the phosphorylated a subunit of elF2 
has been reported in rabbit reticulocytes, 
that could restore the activity of eIlF2 
lost after phosphorylation. The activity 
of this enzyme was stimulated almost 
threefold by an optimal concentration of 
Mn** ions, but not by Ca”* or Mg?*+ 
ions. In contrast, the enzyme activity was 
greatly inhibited by Fe’* ions and purine 
nucleoside diphosphates [1, 75]. 

During post-translational modifications, 
many proteins are modified by pro- 
cesses such as phosphorylation, acetyla- 
tion, and hydroxylation at the side chains 
of the amino acids. In many proteins, 
there is also conjugation of nonprotein 
component(s). 

Recently, the post-translational regula- 
tion of transcription factors has been 
shown to play an important role in the 
control of gene expression in eukaryotes. 
The mechanisms of regulation include not 
only factor modifications, but also regu- 
lated protein-protein interaction, protein 
degradation, and intracellular partitioning. 
In plants, the basic-region leucine zipper 
(bZIP) transcription factors contribute to 
many transcriptional response pathways. 
It has been suggested that plant bZIP 
factors are under the control of various par- 
tially signal-induced and reversible post- 
translational mechanisms that are crucial 
for the control of their function. However, 
only a few plant bZIPs have yet been inves- 
tigated with respect to post-translational 
regulation [76]. 

Oct4 is a key component of the molecu- 
lar circuitry which regulates embryonic 
stem cell proliferation and differenti- 
ation. It is essential for the mainte- 
nance of undifferentiated, pluripotent cell 
populations, and binds with DNA in 
multiple heterodimeric and homodimeric 


configurations. At present, very little is 
known regarding the regulation of the 
formation of these complexes, and of 
the mechanisms by which Oct4 pro- 
teins respond to complex extracellular 
stimuli that regulate pluripotency. How- 
ever, a phosphorylation-based mechanism 
has been proposed for the regulation of 
specific Oct4 homodimer conformations, 
whereby the point mutations of a puta- 
tive phosphorylation site might specifically 
abrogate the transcriptional activity of a 
specific homodimer assembly, with min- 
imal effect on other configurations. It 
has also been shown that altering the 
Oct4 protein levels has an effect on the 
transcription of Oct4 target genes, with 
several signaling pathways having been 
identified that may mediate this phospho- 
rylation and act in combination to regulate 
Oct4 transcriptional activity and protein 
stability [77]. 

Other strategies that act either at or be- 
fore the translation initiation step include 
alterations to the inherent variability in the 
life span of eukaryotic mRNA, and mRNA 
stability in response to certain agents. 
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Keywords 


Coevolution 

Correlated evolutionary changes in the genetic composition of biological objects, such 
as genes. Typically, it is assumed that the change in one object is triggered by the genetic 
change in a second object (although within the context of this chapter the causality 
assumption is not made). 


Coexpression 

Analogous spatial or temporal expression patterns of two or more genes. Typically, 
coexpression can be identified by calculating correlation coefficients and assessing these 
correlations through statistical tests. 


Graph/network 

A representation of a binary relation between a set of objects. The objects are called 
vertices (or nodes). The relation between two vertices is represented by an edge (or link). 
Specifically, a graph of which the edges are directed is termed a directed graph. Directed 
graphs can be used to represent an asymmetric relation, such as gene regulation. 


Gene regulatory network 
A directed network of genes where one gene (a transcription factor) regulates the 
expression of another gene (target gene). 


Homologs and orthologs 

Homologs are genes which originated from a common ancestor. Two homologs found 
in different species are termed orthologs if their least common ancestor in the gene 
evolutionary tree corresponds to a speciation event. Orthologs typically perform closely 
related roles in the corresponding organisms. 


Protein—protein interaction network 
A network where the nodes of while denote proteins and links that represent the binding 
of proteins to carry out their biological function. 


Interactome 


Transcription factor 

A protein which binds to specific DNA sequences to regulate the transcription or 
expression of genes. By controlling the access of RNA polymerase to the genes, 
transcription factors can either activate or repress the expression (the transfer of genetic 
information from DNA to mRNA) of the corresponding gene. 


Yeast two-hybrid (Y2H) 

A yeast-based biological technique used to test protein interactions. One of tested 
proteins is fused to a DNA-binding domain and the other to an activation domain. If 
the two proteins interact a reporter gene is expressed. 


The interactome of a cell consists of all the molecular interactions that occur 
within that cell. Among the diverse types of molecular interaction, genome-wide 
protein-protein interactions are the most broadly studied. In this chapter, attention 
is focused on genome-scale methods to infer and to study protein-protein 
interactions. Following a brief introduction to the subject, the experimental 
and computational techniques used to uncover interactions between proteins are 
outlined. In the context of computational approaches, methods are also discussed 
for predicting interactions between protein domains. Later, the basic topological 
properties of protein interaction networks are described, such as node degree 
distribution, modularity, and network motifs. The motivation and approaches for 
comparing biological networks are then detailed, followed by a brief introduction 
to protein and domain interaction databases. Finally, some sample applications are 
provided that take advantage of interactome data, and details are proposed for the 
possible future direction of interactome-related research. 


1 Biomolecular interactions are conve- 
Introduction niently represented as networks (graphs), 
where the nodes (vertices) represent 
molecules, and the links (edges) represent 
the interactions between them. Depending 


Within a cell, diverse biomolecules work 
together in a coordinated fashion to pro- 


vide specific cellular functions. This co- 
ordinated action is achieved, in large 
part, by a variety of intermolecular inter- 
actions including protein-protein inter- 
actions, protein—DNA interactions, RNA 
interactions, and many others. Here, atten- 
tion is focused mainly on the interactions 
between proteins, including those corre- 
sponding to physical interactions (here, 
protein-protein binding) as well as to 
more abstract “functional” interactions 
between them. However, other interaction 
types will also be briefly mentioned. 


on the type of interaction, the correspond- 
ing edge might be directed, or not. For 
example, a binding of two proteins is typ- 
ically represented by an undirected edge, 
while an interaction between a transcrip- 
tion factor and a gene, the expression of 
which is regulated by the given transcrip- 
tion factor, is usually represented by a 
directed edge. A representation of the in- 
teractome as a network/graph not only 
provides a convenient visualization but 
also enables the use of graph theoret- 
ical concepts and tools in the study of 
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Fig. 1 


biological networks. For example, the Cy- 
toscape suite has recently emerged as a 
leading tool for visualization and network 
analysis, and also allows for the devel- 
opment of third-party tools (plugins) that 
take advantage of its functionality [1]. An 
example of yeast protein-protein interac- 
tion network visualized, with the help of 
Cytoscape, is shown in Fig. 1. 


2 
Experimental Techniques for Detecting 
Protein Interactions 


The two main technologies that are used 
in high-throughput protein-protein in- 
teraction detection experiments are the 
yeast two-hybrid (Y2H) assay, and protein 


Yeast protein-protein interaction network drawn using 
Cytoscape [1] constructed from based on complex purification 
experiments ([2, 3]) with statistical scores from Ref. [4]. 
Modules (proteins that are simultaneously present in several 
protein complexes) are color-coded. 


complex detection: co-immunopreci pi- 
tation (co-IP) and coaffinity puryfication 
followed by mass spectrometry (calP/MS). 
These two techniques are vastly different, 
with each having its own strengths and 
limitations. 

In the Y2H experiment, as pioneered 
by Fields and Song [5], one of the tested 
proteins — the so-called “bait’’ — is fused 
with a DNA-binding domain (usually 
GAL4 or LexA), while the second protein 
(the “‘prey’’) is fused with a transcriptional 
activation domain for a transcription factor 
that can activate expression of a reporter 
gene (e.g., 6-galactosidase). Both chimeras 
are then expressed in a yeast cell and, 
if they interact, their interaction prompts 
expression of the reporter gene. The 


two most important properties of this 
experimental technique to be borne in 
mind are: 


e It detects binary interactions only, and 
thus might miss interacting proteins 
that require additional proteins, such 
as scaffold proteins or other members 
of a protein complex, to facilitate the 
interaction. 

e It uncovers the potential for interaction; 
that is, whether or not the two proteins 
actually interact in a cell depends upon 
spatial, temporal, and contextual con- 
straints such as the cell cycle phase, 
stress, and the presence (or lack) of a 
particular nutrient. 


Currently, high-throughput Y2H inter- 
action networks (maps) are available for 
many organisms, including Saccharomyces 
cerevisiae [6-9], Caenorhabditis elegans 
[10, 11], Drosophila melanogaster [12], and 
humans [13, 14]. 

The second key technology used to ob- 
tain high-throughput interaction maps is 
that of co-complex identification includ- 
ing tandem affinity purification (TAP) fol- 
lowed by mass spectrometry (TAP/MS) 
[15-17]. In contrast to the Y2H procedure, 
this approach reveals one-to-many inter- 
actions in a particular experimental condi- 
tion. Specifically, a bait protein is tested for 
interaction with all other proteins (preys) 
expressed in the given condition. This is 
achieved by allowing a complex formation 
of the bait protein with other proteins in 
the cell, and then retrieving and purifying 
the corresponding complexes and identify- 
ing any co-complexed proteins using MS. 
The retrieval of the corresponding com- 
plexes requires either antibodies for the 
bait protein, or tagging of the bait pro- 
tein with a peptide for which antibodies 
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are available. Consequently, it should be 
borne in mind that: 


e As the experiment recovers whole com- 
plexes, the bait protein does not neces- 
sarily interact directly with all proteins 
in the complex. 

e Transient interactions are often difficult 
to capture using this approach. As with 
the Y2H experiments, a number of high- 
throughput co-complexed interactions 
have been recognized [2, 3, 18-20]. 


The two above-described experimental 
methods are, in many ways, complemen- 
tary and have unique strengths and limi- 
tations [21-23]. Hence, the results of both 
types of experiment are commonly com- 
bined into one protein-protein interaction 
network. While such a combined network 
provides a more complete interactome, it 
is important to bear in mind that, if treated 
individually, the two types of experiment 
used to obtain such a network define two 
rather different networks in terms of their 
biological and topological properties (e.g., 
Refs [9, 24]). 


3 
Computational Prediction of Protein 
Interactions 


Experimental procedures _ detecting 
protein-protein interactions are comple- 
mented by computational approaches. 
The latter methods incorporate a variety 
of techniques that can be subdivided 
into three categories: evolutionary- 
based approaches; statistical methods; 
and machine learning techniques. 
Evolutionary-based approaches _ for 
predicting protein interactions typically 
explore the idea that interacting proteins 
are subject to common evolutionary 
constraints. Such constraints can impact 
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the spatial organization of interacting 
genes in the genome, the position in 
the protein-protein interaction network, 
and/or the amino-acid sequence. While 
some of these approaches are designed to 
predict physical interactions, many do not 
attempt to distinguish between physical 
and functional interactions, as both types 
of interaction might be subject to similar 
evolutionary constraints. 


3.1 
Interaction Prediction from the Gene 
Patterns Across Genomes 


3.1.1 Gene Fusion 

The gene fusion method is an evolution- 
based approach for predicting physical 
interactions. The main idea follows from 
the observation that if a pair of proteins 
(A and B) present in one organism are, 
in another organism, fused together into 
a single protein, then these two proteins 
are likely to interact (Fig. 2a) [25, 26]. A 
natural explanation for this observation is 
that if A and B interact, then bringing 
them together in the fused protein will 
facilitate their interaction. Marcotte et al. 
coined the term ‘“‘the Rosetta Stone protein” 
for the fused protein and, by using this 
approach, were able to identify 6809 
such putative protein-protein interactions 
in Escherichia coli, and 45502in yeast 
(Fig. 2a). In a larger study, Enright and 
Ouziunis uncovered 7224 component and 
2365 unique composite proteins across 24 
species [27]. 


3.1.2 Gene Order 

The whole-genome sequencing of large 
numbers of genomes allows observations 
to be made of the patterns of genome or- 
ganization and evolution [28]. Within the 
context of bacterial and archaeal genomes 
it has been observed that, whilst in general 


the ordering of genes along genomes 
is not well conserved between species, 
the order of genes encoding interact- 
ing proteins does tend to be conserved 
[29, 30]. Conversely, proteins encoded by 
gene pairs with a conserved gene order 
often interact physically. Based on these 
observations, Dandekar et al. proposed a 
method for predicting interacting proteins 
[30] as being proteins encoded by genes 
with a conserved gene order (Fig. 2b). 
It should be noted that this so-called 
“gene order method” and its variants 
[31-34] are most suitable in the context 
of bacterial and archaeal genomes, where 
groups of genes are organized into operon 
structures. 


3.1.3. Phylogenetic Profiling 

The phylogenic profiling method is based 
on the premise that functional interactions 
are conserved across a range of species. 
Consistent with this assumption, there 
should be a correlation between patterns 
of the presence and absence in various 
genomes of functionally interacting 
genes. Such a presence/absence pattern is 
referred to as the phylogenic profile of a gene 
[35-38]. More formally, the phylogenic 
profile of a gene within a set of n reference 
genomes is a vector of length n, where the 
i-th element of the vector is set to 1 if the 
given gene is present in the i-th genome, 
and to 0 otherwise. Similarities between 
phylogenic profiles can be measured 
using metrics such as Hamming distance, 
correlation coefficient, or mutual infor- 
mation. The presence/absence of a gene 
within a genome can be also quantified 
using probability scores instead of binary 
values. Similarities between phylogenetic 
profiles can be used to predict functional 
linkage between proteins (Fig. 2c) [35, 
36]. The grouping of genes with common 
evolutionary patterns also allows not only 
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Fig. 2. Prediction of protein interaction from 
gene pattern in the genomes. (a) The ‘Rosetta 
stone” method that predicts interaction be- 
tween proteins Pl and P2, based on the fact 
that they are fused together into one pro- 
tein in genome 3; (b) The gene order method 
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the prediction of functional associations 
for genes with unknown functions, but 
also the discovery of any previously 
uncharacterized cellular pathways and 
functional network modules [39-54]. 
Moreover, it has been shown that the 
reliability of the phylogenic profile method 
depends on the selection of the reference 
genomes [53, 55]. 

Bowers et al. extended the phylo- 
genic profiling method so that it could 
take into consideration three proteins 
simultaneously. In this case, instances 
were sought in which the combined logical 
patterns embodied by two proteins deter- 
mined the behavior of a third protein [56]. 
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predicts interaction between P1 and P3, based 
on the conservation of their order; (c) The 
phylogenic profile method predicts (functional) 
interaction, based on similar patterns of pres- 
ence and absence in genomes G1-G5. 


Whilst the phylogenic profile predicts 
functional associations rather than phys- 
ical interactions, an important drawback 
of the method is that it cannot be used to 
predict interactions between proteins that 
are present in all (or nearly all) reference 
genomes. 


32 
Predicting Interaction from Sequence 
Coevolution 


Whereas, investigations conducted so far 
have centered on approaches that uti- 
lize patterns of gene organization within 
genomes, the “mirror tree’ method 
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(as discussed here) takes the next step 
and zooms in on the details of sequence 
features. This approach is based on the 
premise that sequences of interacting pro- 
teins are expected to coevolve to maintain 
their interactions [57-60]. Motivated by 
this supposition, the mirror tree method 
predicts protein-protein interactions by 
assessing the extent of agreement between 
evolutionary trees or, more precisely, be- 
tween the distance matrices used to in- 
fer such trees, as illustrated in Fig. 3 
[57, 59-67]. 

In its simplest form, the mirror tree 
method assesses the coevolution of two 
proteins by the correlation between the 
distance matrices constructed individually 
for the sets of sequences orthologous to 
each of the two proteins. That is, given 
two proteins A and B, it considers n se- 
quences orthologous with A and the same 
number of sequences orthologous with B 


Protein A Protein B 


| 


\7— 


Fig. 3. The basic variant of the mirror tree 
method. Coevolution between proteins A and 
B is assessed by comparing the evolutionary 
rates of a family of sequences orthologous to 
A, and the corresponding family of sequences 


a:| 


deriving from the same set of species. Sub- 
sequently, for each pair (i,j) of orthologs of 
A (respectively B), the method estimates 
the evolutionary distance Aj, (respectively 
B,,) between them. The degree of coevolu- 
tion between the two families of orthologs 
is then assessed by computing the cor- 
relation coefficient between the distance 
matrices (see Fig. 3). 

As described above, the mirror tree 
method measures the correlation between 
rates of evolutionary change (rates of di- 
vergence). Two reasons have been iden- 
tified for which such correlation might 
occur: (i) a common speciation history; 
and (ii) common evolutionary constraints 
imposed by physical and/or functional 
interactions. Thus, one of the main chal- 
lenges related to the mirror tree method 
is to separate these two sources of correla- 
tion, and several techniques have recently 
been developed to address this problem, 


Ortholog sequences 
from a common set of species 


Matrices of pairwise 
evolutionary distances 


Correlation between Aj and Bj 


orthologous to B. Subsequent variants of the 
method attempt to account for the common 
speciation history (illustrated by the evolution- 
ary tree on the right) of both families. 


with marked improvements in accuracy 
when predicting interactions [64, 65, 68]. 
In the methods developed by Pazos et al. 
and Sato et al., the estimated organism di- 
vergence rates were first subtracted from 
the combined coevolution signal. Subse- 
quently, Kann et al. further improved the 
mirror tree performance by restricting the 
coevolution analysis to the more conserved 
regions in the protein domain sequences 
while disregarding any highly divergent 
regions. The latter regions are likely to be 
diverged by neutral evolution, and would 
not be expected to contain a functional 
coevolution signal. 

It is natural to hypothesize that the 
driving mechanisms of sequence coevolu- 
tion derive from compensating mutations, 
where a mutation in one binding part- 
ner is compensated by a complementary 
mutation in another partner, so as to main- 
tain amino acid interactions. Yet, it has 
been shown that compensating mutations 
are not the only — perhaps not even the 
dominant — contributors to the correla- 
tion of evolutionary rates [69, 70]. Rather, 
the coevolution signal is more likely a 
composite of many other factors, such 
the coexpression of interacting proteins, 
similar codon usage, and interaction with 
other proteins in a complex. This suggests 
that it might be more meaningful to con- 
sider the coevolution (or coadaptation) of 
proteins within a broader, network-level 
context. Subsequently, this proposal was 
explored by Juan et al., who replaced the 
vector of evolutionary distances by the vec- 
tor of coevolutionary correlations between 
all proteins within a genome and, in this 
way, obtained a reliable interaction net- 
work of E. coli [71]. 

Currently, several other variants of the 
mirror tree method are available. For 
example, Tiller and Charlebois replaced 
a simple matrix correlation by a more 
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sophisticated search for the most similar 
common subtree, that can also include 
paralogs [72]. Another variant of the mirror 
tree approach has also been used to predict 
interaction specificity. In this case, given 
two families of proteins which are known 
to interact, the objective is to establish a 
mapping by defining interaction partners 
between the members of one family and 
members of the other family [62, 73, 74]. 
Finally, Jothi et al. used the mirror tree 
method to identify interacting domains 
within interacting proteins [61]. 


3.3 
Domain Interactions 


A large proportion of prokaryotic pro- 
teins, as well as most eukaryotic proteins, 
are composed of more than one domain 
[75]. Protein interaction typically involves 
binding between two or more specific do- 
mains; indeed, the domain composition 
of two proteins can be used to predict 
interactions between them [76-84]. Con- 
versely, knowledge of a protein interaction 
network can be used to infer interacting 
domains [85-90]. 

The concept of using protein— 
protein interaction networks to predict 
domain—domain interactions was first 
explored by Sprinzak and Margalit, who 
proposed a simple statistical approach 
that was referred to as the Association 
Method [82]. The idea here was to score 
each domain pair by the log ratio of the 
frequency of occurrences in interacting 
proteins to the expected frequency of 
independent occurrences of the two 
domains [82]. That is, if P; is the observed 
frequency of domain i in the interaction 
network, and Pj is the observed frequency 
of domain pair (i, j) in the interacting 
protein pairs, then the Association Score 
(AS;) = log aoe The interacting domain 
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pairs are then predicted as those with the 
highest AS-value. 

Following investigations conducted by 
Sprinzak and Margalit, several related 
methods have been proposed (for a re- 
view, see Ref. [86]). For example, Deng 
et al. developed a maximum likelihood 
approach to estimate the probability 
of domain—domain interactions [81] in 
which the idea was to estimate, for each do- 
main pair, the probability of an interaction 
between domains, so that the likelihood 
of the interaction network would be maxi- 
mized. An elegant feature of this approach 
was the explicit modeling of errors in the 
high-throughput data that constituted the 
protein interaction network. 

Whilst these early domain interaction 
prediction methods were successful in 
identifying domains that would interact 
in a constitutive manner, they were chal- 
lenged in situations where a domain pair 
(i, j) interacted in the context of some pro- 
tein pairs but, at the same time, there are 
also many proteins that contained i and 
J, respectively, but did not interact. In an 
attempt to discover such context-specific 
interactions, Riley et al. introduced the 
so-called domain pair exclusion analysis 
(DPEA) [91], which was a clever utiliza- 
tion of the maximum likelihood approach. 
Notably, the likelihood score of a net- 
work can be viewed as a measure of how 
well the probabilities assigned to puta- 
tive domain interactions could explain the 
network. Thus, if the domain pair (i, j) 
were to mediate protein—protein interac- 
tions in a context-specific way, then the 
exclusion of such a domain pair as a pos- 
sible interacting pair should reduce the 
likelihood score of the network. Consis- 
tent with this premise, DPEA can be used 
to predict interactions between domains by 
monitoring the fall in the likelihood score 


when a particular domain pair was not al- 
lowed to interact. Moreover, this approach 
could be used to detect interacting domain 
pairs that had been missed by previous 
approaches. More recently, Wang et al. 
[90] improved this concept further by sug- 
gesting the use of a scoring method that 
would account more fully for the context in 
which the interaction occurred. In order to 
achieve this, rather than globally disallow- 
ing all interactions between two specific 
domains, as had been proposed by Riley 
et al., only interactions in the context of a 
specific pair of interacting proteins were 
disallowed. 

The idea of recovering interacting do- 
mains by examining how well the po- 
tential domain contacts could explain the 
protein interaction network, also formed 
the basis of a method proposed by 
Guimares and colleagues [85]. Building 
on the assumption that protein interac- 
tions had evolved in a most parsimo- 
nious way, these authors proposed the 
Parsimonious explanation method, which 
identified the smallest weighted set of 
domain interactions that could “explain” 
the protein interaction network. In other 
words, the method located the small- 
est set of domains such that, if they 
were considered to interact, then each in- 
teracting protein pair would contain at 
least one interacting domain pair. This 
model was formalized as an optimiza- 
tion problem, and solved with a linear 
programming procedure. In this case, 
the variables of the linear program rep- 
resented the potential domain contacts 
derived from the protein interaction net- 
work, while the constraints were defined as 
protein-protein interactions (edges). The 
construction is illustrated schematically in 
Fig. 4. 

So far, the methods used to predict inter- 
acting domains have utilized only protein 


interaction—interaction networks and the 
domain composition of each protein. Yet, 
this basic information can be enriched 
in many ways [90, 92]. For example, by 
analogy to the Rosetta Stone approach 
(see Sect. 3.1.1), it has been observed that 
two domains that co-occur in one protein 
chain are also likely to mediate interactions 
between different proteins. The incorpora- 
tion of such additional information has the 
potential of improving domain interaction 
predictions [93, 94]. 

One formidable challenge that is faced 
by the methods used to predict inter- 
acting domains is an evaluation of the 
method’s quality. Typically, such an eval- 
uation would be based on a knowledge 
of the domain interactions derived from 
crystal structures, although unfortunately 
these data would be biased towards par- 
ticular types of interaction, thus limiting 
the ability to test the methods to the 
full [94]. 


Fig. 4. Domain—domain interaction inference 
from protein interaction networks via a linear 
programming construction. In this toy net- 
work, the rectangles correspond to proteins 
and the colored squares to domains. For each 
domain pair (i,j) there is a variable taking real 
values between 0 and 1. For each pair (A,B) 
of interacting proteins A and B, there is one 
constraint DicajeaXj = 1 where the summation 


Interactome 


3.4 
Coexpression Networks 


Starting from the assumption that proteins 
from the same protein complex are likely 
to be coexpressed, expression data have 
been utilized to predict new and to validate 
previously known protein-protein interac- 
tions [95-101]. Coexpression networks are 
usually inferred by computing the Pearson 
correlation coefficients or mutual informa- 
tion between every pair of gene expression 
profiles, across a variety of experimental 
conditions. It has been shown that genes 
with similar expression patterns across 
a set of samples tend to be functionally 
related [102]. Consequently, the coexpres- 
sion data are frequently combined with 
other types of data in order to predict 
protein-protein interactions and to cre- 
ate functional networks. Such networks, 
built using expression information, have 
been constructed for variety of organisms 
(95-97, 103-107]. 


Domain i 


Domain / 


is over all domain pairs (i,j), where i is a do- 
main of protein A and j is a domain in protein 
B. Thus, the constraint for each interacting 
protein pair enforces that the values of the 
variables representing the potentially inter- 
acting domain pairs add up to at least 1.0. 
Following the parsimony principle, the objec- 
tive function aims to minimize the overall sum 
of the variables. 
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Expression profiles are often used to in- 
fer regulatory networks. The underlying 
assumption explored in gene regulatory 
network reconstruction programs, such as 
ARACNE [108], is that any change in the 
expression of a transcription factor should 
be mirrored by change in the expression 
of genes regulated by a given transcription 
factor. Coexpression alone does not pro- 
vide information on the direction of reg- 
ulatory relationship; however, expression 
data can also be used to construct Bayesian 
networks that can represent the condi- 
tional dependence of expression levels (for 
details of a primer on Bayesian network 
analysis utilizing expression data, see Ref. 
[109]; for a recent review, see Ref. [110]). 
A related approach to orient the edges was 
proposed by Schadt et al. [111, 112]. 


4 
Exploring the Topology of the Interactome 


Graph theory provides a unifying lan- 
guage to describe the relations within 
complex systems, and during recent years 
has played an increasingly important 
role in understanding biological systems. 
The process enables the use of graph- 
theoretical tools and concepts to interro- 
gate the properties of interaction networks. 
Currently, several packages are available 
for visualizing, modeling, and analyzing 
various types of network, including the 
popular Cytoscape package [1, 113]. 


4.1 
Global Properties 


Studies of protein interaction networks 
from several model organisms have re- 
vealed that such networks often have in- 
teresting topological properties. One of 
the most celebrated such properties is 


a particular distribution of node degrees 
(the number of immediate network neigh- 
bors for a given node). Specifically, it has 
been argued that for these networks the 
degree distribution is consistent with a 
power law [114-116]; in other words, it has 
been suggested that interaction networks 
are scale-free. Formally, in a scale-free net- 
work the fraction P(k) of nodes in the 
network having degree k is proportional 
to k-Y; that is, P(k) ~ k~”, where y is a 
constant the value of which is typically in 
the range 2 < y < 3. While the accuracy 
of the supposition that the protein inter- 
action networks are scale-free has been 
questioned [117-119], the main property 
of protein interaction networks consistent 
with a scale-free network — a small num- 
ber of highly connected nodes (hubs) anda 
large number of weakly connected nodes — 
is generally not disputed. 


4.2 
Network Centrality and Protein Essentiality 


Network centrality is a measure of the 
topological prominence of a node within 
a network. However, it is unclear whether 
network centrality is related to protein 
function. Several approaches exist for the 
measurement of node centrality, empha- 
sizing the different aspects of network 
topology [120]. In the context of biologi- 
cal networks, the most studied centrality 
indices are degree centrality and between- 
ness centrality, although many other indices 
have been also considered [24, 121, 122]. 
Degree centrality evaluates the node’s cen- 
trality by the number of its immediate 
neighbors in the network (Fig. 5), with 
nodes having a high degree being referred 
to as hubs. In contrast, in the shortest- 
path betweenness centrality, the node’s 
centrality value is proportional to the frac- 
tion of shortest paths between all pairs 


Fig. 5 Illustration of centrality measures. 
Vertices a, a’, b, and b’ tie for the high- 

est degree centrality of 4. The orange ver- 
tices c and c’ have the highest betweenness 


of nodes that pass through a given node 
(Fig. 5). Thus, betweenness centrality mea- 
sures whether a node might be central for 
the information flow within the network. 
Alternatively, by modeling the interactive 
as an electric circuit, the node’s centrality 
can be measured by the amount of current 
passing through it. Specifically, for any 
pair of nodes the current in the network 
can be computed under the assumption 
that a given pair of nodes is taken as the 
source and the sink. The current flow cen- 
trality for an individual node is the sum 
of currents that pass through this node 
[24, 123-127}. 

Subsequently, in an influential report 
Jeong et al. [114] showed that a pro- 
tein interaction network of S. cerevisiae, 
high-degree nodes contained more essen- 
tial proteins than would be expected by 
chance. (In that study, a gene was termed 
“essential” if its deletion prevented yeast 
growth under optimal laboratory condi- 
tions.) Similarly, nodes with high between- 
ness centrality (bottlenecks) have been 
correlated with gene essentiality [128]. 
The correlation between vertex degree and 
essentiality has been studied extensively 
[24, 129-131]; interestingly, however, this 
relation has not been clearly observed in 
Y2H networks [9, 24, 130]. 
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centrality, since every shortest path be- 
tween the green component and the blue 
component must pass through these 
vertices. 


It has been proposed that an enrichment 
of essential proteins among high-degree 
nodes is an implication of the central 
role that hubs play in mediating inter- 
actions among other proteins. Indeed, the 
removal of hubs disrupts the connectivity 
of the network (as indicated by the net- 
work diameter or the size of the largest 
connected component) to a greater degree 
than does the removal of an equivalent 
number of random nodes [114, 132, 133]. 
However, the lack of any clear relation 
between essentiality and vertex degree in 
the Y2H-derived networks, as well as com- 
putational evidence [24], suggests that the 
correlation between vertex degree is re- 
lated to an enrichment in the essential 
proteins of large dense subnetworks, typi- 
cally corresponding to complexes. This ob- 
servation was further confirmed by Wang 
et al., who also showed that the enrich- 
ment of complexes with essential proteins 
was increased with complex size [134]. 
In contrast, in yeast binary interactions 
Y2H, Yu et al. identified a relationship be- 
tween the essentiality of a protein and the 
number of cellular processes in which it 
participates. 

By combining vertex degree information 
with expression data, Han et al. were 
able to distinguish two groups of hubs 
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that they referred to as “party” hubs 
and “‘date’’ hubs [135]. In the case of a 
party hub, the expression of the hub node 
is correlated with that of its neighbors, 
which suggests that all of the interactions 
may take place simultaneously, or under 
similar conditions. In contrast, in the case 
of a date hub, the correlation between the 
expression of the hub and its neighbor 
is, on average, low. Thus, these two 
types of hub are proposed to play distinct 
roles in the interactome, with party hubs 
being members of protein complexes 
or functional modules, and date hubs 
corresponding to global regulators that 
possibly link various functional modules. 


4.3 
Network Modules 


In their landmark report, Hartwell et al. 
proposed that functional modules served 
as a critical level of biological organization 
[136]. Thus, they defined a functional mod- 
ule as an entity, composed of many types 
of interacting molecule, the functions of 
which were separable from those of other 
modules. Currently, the modularity of 
biological systems is a widely accepted 
phenomenon and, indeed, by analyzing 
an early yeast protein—protein interaction 
network, Schwikowski et al. observed that 
proteins of known function and cellular 
location tended to cluster together. Subse- 
quently, the genome-scale reconstruction 
of biological networks that were enabled 
by current technologies provided the con- 
text for identifying such modules. There 
is, however, no unique way in which 
the functional modules could be mathe- 
matically defined. Computationally, most 
methods will seek densely connected sub- 
graphs or clusters by using a variety of 
heuristics ranging from growing modules 
from seed clusters, to clustering based 


on graph-theoretical distance measures, to 
Markov Chain Monte Carlo (MCMC) clus- 
tering approaches [137-154]. Additionally, 
gene expression information can be uti- 
lized to obtain more reliable modules 
[155-157]. 

One characteristic property that mod- 
ules would be expected to satisfy naturally, 
is that the molecules within the mod- 
ule are more strongly connected between 
themselves, than are molecules within the 
module to those outside. This intuition can 
be formalized by the following concept of 
modularity. Given a partition of nodes of 
network with m edges (links) into groups 
Cy, Co, ..., C,, the modularity of such a 
partition can be defined as: 


1 kik; 
C= a (a- =) Cij 


ij 


where Aj equals 1 if i andj are connected 
in the network, and zero otherwise; and Cj 
is equal to Lif iandj are in the same group, 
and zero otherwise. Thus, the defined QO 
takes values of between —1 and 1, where 
a positive value of Q indicates that the 
number of edges within groups is higher 
than would be expected by chance. Fol- 
lowing such (or related) a definition, some 
methods have identified modules as parti- 
tions into “communities” that maximize 
modularity [150, 158). 

Most module-finding algorithms assign 
each node to at most one module although, 
in practice, biological modules can overlap 
and/or form modular hierarchies. Further- 
more, a given component may belong to 
a different module at a different time. 
Thus, some of the more recently developed 
approaches have focused on identifying 
overlapping modules [139, 159], their hier- 
archy [160], or their dynamics along activity 
pathways [161]. 


Given the wealth and diversity of the 
module-finding algorithms, it is impor- 
tant to provide some means of establishing 
the biological relevance of uncovered mod- 
ules. The most commonly applied strategy 
is to evaluate how well the various meth- 
ods perform in uncovering either known 
complexes [162] or potentially overlap- 
ping functional modules [163], and/or how 
well they are conserved through evolution 
[164]. In a recent evaluation of module- 
finding algorithms, Song and Singh noted 
that the performances of various algo- 
rithms in uncovering functional modules 
could differ substantially when run on 
the same network. Moreover, their rela- 
tive performances changed, depending on 
the topological characteristics of the net- 
work under consideration. Taken together, 
these results indicated that there is, at 
present, no single best approach to this 
problem. 


44 
Network Motifs and Related Concepts 


In Sect. 4.1, it was observed that the 
vertices in a protein—protein interaction 
network tend to have a characteristic vertex 
degree distribution, manifested by a small 
number of high-degree nodes and a large 
number of nodes with a very small degree. 
As a natural extension of node degree, the 
distribution of small subnetworks can be 
considered as triangles, squares, and so 
on. Along this line, Milo et al. defined net- 
work motifs as subgraphs that occurred in 
a network much more often than would be 
expected by chance [165]. The “by chance” 
occurrence in a given type of network is 
usually estimated by constructing a set 
of random networks with the same ba- 
sic properties as the tested network; for 
example, the same degree distribution. 
Subsequently, in a landmark report Milo 
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et al. showed that the various networks 
were characterized by an overrepresenta- 
tion of certain network motifs. By focusing 
on directed networks, Milo et al. showed — 
among other findings — an overrepresen- 
tation within the gene regulatory network 
of feed-forward loops and bi-fan network 
motifs (Fig. 6a). Network motifs have been 
shown to support specific regulatory func- 
tions [166-170]. In the context of undi- 
rected protein-protein interaction net- 
work, Przulj et al. studied the distribution 
of small induced subgraphs (graphlets) 
[171]. As a somewhat related concept, 
but taking advantage of protein functional 
annotations, Banks et al. introduced vari- 
ous network schemas to describe patterns 
of labeled subgraphs [172]. A network 
schema consisted of descriptions of pro- 
teins (e.g., their molecular functions or 
putative domains), along with the desired 
topology and types of interaction (e.g., 
physical, phosphorylation, or regulatory) 
(Fig. 6b). In addition to searching for 
matches to particular network schemas, 
it is also possible to infer which network 
schemas are frequent and which are over- 
represented in networks [173], and thereby 
to uncover any general recurring patterns 
that might underlie a range of biological 
processes. 


5 
Comparing Protein-Protein Interaction 
Networks 


The interaction prediction methods dis- 
cussed in Sect. 3.3 were based on the 
assumption that, if two proteins interact in 
one organism, then the orthologs of these 
two proteins in another organism would 
also be very likely to interact. With this 
in mind, Walhout et al. coined the term 
interologs to describe such orthologous 
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Feed forward Bi-fan 
(a) (b) 


Fig. 6 (a) Examples of network motif in regulatory network motifs: feed-forward and bi-fan motifs. The arrows 
indicate the direction of regulation; (b) Example of a network schema associated with signaling. The nodes are 
labeled with specific feature description, such as GTPase and proteins kinase, while the edges are labeled with 
the interaction type. 


pairs of interacting proteins [174]. Indeed, 
it has been shown that a protein interaction 
map generated in one species can be used 
to predict interactions in another species 
[174-176]. 

The idea of transferring interaction 
annotation between organisms can be 
extended further by comparing whole 
interaction networks between different 
organisms. Such network comparison 
allows a number of fundamental biological 
questions related to the evolution of 
protein interaction networks to be ad- 
dressed, and predictions to be made 
regarding new functional information 
relating to proteins and interactions that 
are at present poorly characterized [177]. 
Just as sequence alignment represents 
the cornerstone of sequence comparison, 
network comparison demands methods 
for the alignment of biological networks. 
Indeed, the alignment of interaction 
networks from different organisms al- 
lows the discovery of evolutionarily con- 
served pathways and functional orthologs 
[178-182]. 

From an algorithmic-theoretical per- 
spective, network alignment is a difficult 
problem which, in the most general for- 
mulation, is reduced to the finding of a 
maximal common subnetwork of two (or 
more) networks. This classical problem in 
graph theory is known to be NP-complete 
and thus difficult to solve computation- 
ally [183], and it is unlikely therefore that 
a fast algorithm to solve the problem in 
the full generality actually exists. Over the 
years, however, a number of insightful and 
efficient algorithms have been proposed 
that have taken advantage of the various 
specific properties of biomolecular net- 
works [180, 184-191]. These algorithms 
allow for global network alignment and 
local alignments, as well as identifying the 
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alignments of a subnetwork within a larger 
network. 

With these tools at hand, it is possible 
to search for conserved network regions 
such as conserved protein complexes and 
pathways, to identify proteins despite a 
lack of sequence similarity, to perform 
the same function in the network, or 
even to use network similarities to infer 
evolutionary trees [192, 193]. 


6 
Databases of Protein and Domain 
Interactions 


During recent years, as the amount of 
data relating to protein and domain in- 
teractions has steadily increased, various 
databases and public repositories have 
been constructed to enable a sharing of the 
knowledge, and the support of subsequent 
studies. Recent reviews of these databases 
can be found in Refs [194-196], and some 
representative examples of such databases 
are provided in this section. The database 
of interacting proteins (DIP) catalogs ex- 
perimentally determined protein—protein 
interactions which are obtained from var- 
ious resources, including the literature, 
the Protein Data Bank (PDB), and high- 
throughput experiments [197, 198]. Both, 
IntAct [199] and BioGRID [200] contain the 
details of not only protein-protein interac- 
tions but also of other types of interaction, 
such as protein—small molecule interac- 
tions and genetic interactions. MINT [201] 
annotates each interaction with a score 
which ranges from 0 to 1 for quanti- 
fying the interaction support. As none 
of the current databases can provide the 
complete information regarding interac- 
tions for all species, some research groups 
have attempted to extract and unify in- 
teraction data from different repositories; 
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APID [202] and PINA [203] are two rep- 
resentatives of such meta-databases. Fur- 
thermore, some databases (e.g., STRING 
[54, 204, 205] and 12D [206]) include 
protein-protein interactions predicted by 
computational approaches. Finally, some 
dedicated protein interaction data focusing 
on specific model organisms have been 
incorporated as a part of the organism- 
related resources such as FlyBase [207] 
for D. melanogaster, and the Saccharomyces 
Genome Database (SGD) [208] for yeast. 

The analysis of domain interactions of- 
ten provides important insights into both 
the role and the mechanism of an inter- 
action. The databases 3did, [209], iPFAM 
[210], and PIBASE [211] allow for explor- 
ing the details of domain interactions by 
studying the three-dimensional structures 
of interacting domains extracted from 
the PDB [212]. The Conserved Binding 
Mode (CBM) database [213] categorizes 
interacting domains by the Conserved Do- 
main Database (CDD) family type and 
interaction mode. DOMINE contains both 
known and predicted domain interactions 
obtained from two three-dimensional, 
structure-based databases (iPFAM and 
3DIA), and eight different computational 
approaches [214]. 


7 
Applications 


A knowledge of protein interactions can 
provide important clues concerning the 
functioning of cells and organisms. In this 
section, several examples are provided of 
how interaction networks can be explored 
to empower biomedical research. Clearly, 
the examples listed below represent only 
a small sample of the diverse applications 
of bimolecular networks. 


7.) 
Predicting Protein Function 


The observation that the majority of in- 
teractions occurs between proteins with a 
common functional assignment [215] has 
paved the way towards several approaches 
for predicting protein function, based on 
the location of the protein in the network in 
relation to functionally annotated proteins. 
Sharan et al. have subdivided the emerging 
methods into two types, namely direct and 
module-assisted schemes [216]. In the di- 
rect annotation schemes, individual links 
in the network are used for inferring the 
functions of proteins, whereas module- 
based methods first detect modules of 
interconnected proteins and then assign 
protein functions based on the functional 
annotation of other proteins in the mod- 
ule. Thus, the key step of module-based 
methods is to use a module-finding ap- 
proach (as discussed in Sect. 4.3). 

At this point, attention will be focused 
on the direct annotation approaches. 
The pioneering method developed by 
Schwikowski et al. predicted the biologi- 
cal process of a non-annotated protein by 
considering its neighboring interactions, 
and assigning to this protein the anno- 
tations that were most frequent among 
the neighbors [215]. This strategy worked 
very well for biological networks with a 
high proportion of annotated proteins, 
where un-annotated proteins had many 
annotated neighboring proteins [217]. Sub- 
sequently, Hishigaki et al. extended this 
neighbor-derived annotation method by 
considering all of the proteins within a par- 
ticular radius, rather than just the direct 
neighbors of given proteins [218]. Later, 
Nabieva et al. argued that during the trans- 
fer of functional annotation from more 
distant neighbors, account should be taken 
not only of the distance to the annotated 


proteins, but also of the topology of the 
network. This view was utilized in the 
“functional flow’ algorithm described by 
these authors [217]. 

The integration of network topology with 
other types of data, such as gene ex- 
pression profile, domain context, and text 
mining may further inform functional an- 
notation [216, 219, 220]. Kourmpetis et al. 
proposed the use of a Markov random field 
analysis for integrating protein interaction 
networks with multiple data sources [221]. 


7.2 
Application to Human Diseases 


Today, it is increasingly recognized that 
complex diseases should be studied from 
the perspective of dysregulated path- 
ways and processes, rather than of in- 
dividual genes. Indeed, during recent 
years the availability of genome scale 
protein-protein interaction and other in- 
teraction maps has made it possible to 
begin such systems-level investigations of 
human diseases. By following this princi- 
ple, Chuang et al. proposed a network- 
based method for the classification of 
breast cancer metastasis [222], the main 
idea being to combine gene expression 
profiles with network/pathway informa- 
tion and thus to seek disease-altered 
subnetworks. The underlying assumption 
here is that disease-related perturbations, 
manifested by gene expression changes, 
would propagate over the interaction net- 
work and lead to clusters of perturbed 
nodes. Importantly, whilst some abnor- 
mally expressed genes might differ be- 
tween disease cases, many clusters would 
be expected to remain common. Such 
a perspective has proven to be helpful 
for disease classification [223, 224], in 
the identification of disease dysregulated 
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pathways [225, 226], and for identifying 
disease-associated genes [227-233]. 

The results of recent studies have also 
begun to connect disease-perturbed net- 
works with genetic variations. By integrat- 
ing expression and genotypic data from 
an intercross population, Cheng et al. 
were able to identify a liver and adi- 
pose macrophage-enriched subnetwork 
that was associated with metabolic disease 
traits [234]. More recently, Kim et al. pro- 
posed a method to trace the propagation 
of the effects of copy number variations 
through interaction network and the iden- 
tification of biological pathways affected 
by such perturbation. This employs gene 
expression profiles, copy number varia- 
tion information, and diverse interaction 
networks [124, 126, 127]. 


8 
Looking Ahead: Towards the Dynamic 
Interactome 


The availability of a genome-scale inter- 
actome has provided the possibility of 
asking general questions with regards to 
the organization of biological systems, and 
of informational flow within such sys- 
tems. Whilst in this chapter attention has 
been focused on protein-protein interac- 
tion networks, for practical applications — 
including some of those described here — 
integrative networks combining the vari- 
ous types of interaction are increasingly 
being used. Today, a growing number of 
tools are available that allow the inter- 
rogation of existing networks to generate 
testable hypotheses, despite the majority of 
current studies still treating molecular net- 
works as “hardwired”’ structures. Clearly, 
real biological networks are characteris- 
tically dynamic - a property on which 
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their functioning is heavily dependent. Re- 
cently, Przytycka et al. categorized interac- 
tome dynamics as being spatial, temporal, 
and contextual [235]. Hopefully, the second 
decade of interactome studies will incorpo- 
rate a shift from static to dynamic network 
analyses, thereby providing a deeper un- 
derstanding of molecular systems. 
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Keywords 


Microbiome 

The totality of microbes and the collective genomes of all the organisms and 
environmental interactions, within a defined environment. A defined environment 
might be the gut of a human being, or an animal. Thus, the microbiome usually 
includes microbiota and their complete genetic elements. 


Microbiota 
The collection of microbes that colonize a particular body site or a host. 


Metagenomics 

Metagenomics is a culture-independent study of microbial communities contained in 
an environmental sample. Metagenomics describes the functional and sequence-based 
analysis of genetic material from a mixed community of organisms contained in an 
environmental sample. 


Personalized medicine 
The customization of an individual patient’s preventive and therapeutic care, based on 
genetic or other information. 


Metagenomics, also referred to as environmental or community genomics, has brought 
about radical changes in the ability to analyze complex microbial communities by 
direct sampling of their natural habitat. Metagenomics has truly revolutionized 
biology and medicine, and changed the way in which genomics is studied. To 
date, many metagenomic studies have been undertaken, with samples from diverse 
habitats including the oceans, soil, air, human, and animal hosts having been subject 
to metagenomic examinations. Currently, huge national and international projects, 
aimed at elucidating the biogeography of microbial communities living within and 
on the human body, are well underway. The analysis of human microbiome data 
has brought about a paradigm shift in the present understanding of the role of 
resident microorganisms in human health and disease, and brings nontraditional 
areas such as gut ecology to the forefront of personalized medicine. In parallel, rapid 
technological advances in DNA sequencing methods have reduced the time and 
costs associated with sequencing while at the same time significantly increasing the 
data output. As genome sequencing becomes cheaper, itis being applied to sequence 
complex metagenomes, and large-scale 16S ribosomal DNA sequencing has become 
far more routine. Today, metagenomics is proving to be a powerful tool, considerably 
enhancing the present understanding of the extent and role of microbial diversity 
in their natural habitats, and in many ecologically important environments, with 
far greater implications on human health and disease. An overview of the current 
literature, together with details of projects and the state-of-the-art in microbiome 
studies, are presented in this chapter. 


1 
Introduction 


Traditionally, genome-sequencing pro- 
jects have relied on the availability of cul- 
tivated isolates to determine the complete 
genetic complement of an organism by 
sequencing the base pairs of its DNA. 
Inherently, however, this approach is sig- 
nificantly limited in its inability to recover 
genetic information from uncultivated 
microorganisms. Metagenomics is the 
study of microorganisms recovered from 
an environment in a culture-independent 
fashion, typically with the goal of simply 
measuring the population structure, 
genetic diversity, and ecological relation- 
ships of the microbes in a sample, but 
increasingly with the broader goal of 
building a systems biological model of the 
community. The study of single microbial 
species at the genomic level, and the con- 
current development of tools to sequence, 
assemble, and analyze these species, have 
accelerated the field of metagenomics. 
By using the power of genomic analysis, 
the emerging field of metagenomics 
provides novel approaches for analyzing 
genetic information, to reveal important 
characteristics of microbial communities 
within a sample. The assemblage of 
microorganisms that reside within or 
on human and animal hosts has been 
termed the indigenous microbiota. The 
term microbiome, as originally introduced 
by Joshua Lederberg, essentially refers to 
the community of microbes, their genetic 
elements (genomes), and environmental 
interactions in an “ecological niche” such 
as the human gut or a soil sample. 

The true capabilities of new sequenc- 
ing technologies and progress in the field 
were exemplified by the first large-scale 
environmental metagenomic study, con- 
ducted by Craig Venter and his colleagues 
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at the J. Craig Venter Institute (JCVI), 
and reported in 2004 [1]. The results 
included details of the sequencing and 
analysis of samples collected from the 
Sargasso Sea by using whole-genome shot- 
gun (WGS) sequencing, coupled with 
high-performance computing developed 
to sequence the human genome. Through 
this study, the team reported the discov- 
ery of more than 1.2 million new genes. 
This initial study evolved to become JCVI’s 
Global Ocean Sampling [2] initiative (as de- 
scribed in more detail below) which to date 
has generated millions of sequences de- 
rived from sites in the oceans, worldwide. 
To date, the Global Ocean Sampling (GOS) 
expedition to study marine microbial di- 
versity from a complex environment re- 
mains the largest protein dataset [1, 2], and 
has also resulted in a number of spin-off 
studies that have been focused on the 
datasets that were initially generated [3-5]. 

Other large-scale metagenomic projects, 
such as the National Institutes of Health 
(NIH) -funded Human Microbiome 
Project (HMP), which was initiated to 
study human-associated microbial com- 
munities, have continued to evolve along 
with smaller single-investigator-driven 
projects focused on animals that include 
ruminants [6], canines [7], poultry [8], 
and nonhuman primates [9], as well as 
various insect species [10, 11]. The main 
results of these investigations have been 
the generation a tremendous amount 
of data, together with an acceleration of 
computational tool development aimed 
at interpreting genome sequence data, 
and at interpreting species diversity, 
as well as the provision of laboratory 
tools to identify, culture, and sequence 
novel species. Today, studies such as 
the HMP continue to make inroads 
into understanding the larger picture of 
human health, in wellness and disease, in 
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addition to the environmental impact of 
microbial inhabitants on disease, immune 
status, and human physiology. 


2 
History of Microbial Diversity Studies 


In the past, 16S rDNA gene analysis has 
been used widely to identify microbial 
species and to interrogate the diversity 
of microbes in a range of environments. 
The value in using the 16S rDNA gene 
as a phylogenetic marker derives from 
its being a necessary component of the 
ribosomal apparatus of microbial cells, 
and hence being present in all prokaryotic 
genomes. It also has a relatively slow rate 
of evolution coupled with highly conserved 
motifs, which are ideal for primer design. 
Several faster-evolving (variable) segments 
within the 16S rDNA gene sequence 
provide useful phylogenetic information, 
and are compatible with some of the 
next-generation sequencing technologies. 
Lastly, there is little evidence that the 16S 
gene moves by lateral gene transfer; thus, 
it tracks closely with the protein-coding 
genes that define a particular microbial 
lineage. 

The 16S rDNA gene can, however, be 
less than perfect, as its copy number varies 
from species to species and even between 
strains, with copy numbers ranging any- 
where from one to a dozen copies per cell. 
As such, 16S rDNA gene sequencing is 
an unreliable indicator of the actual abun- 
dance of the species being investigated. 
Intraspecies differences in ribosomal oper- 
ons have also been described [12-15], 
again leading to the possible misinterpre- 
tation of results obtained from surveys. 
Furthermore, while 16S has largely been 
successful at classifying organisms down 
to the species level, it evolves too slowly 


to capture the diversity within species. 
Finally, the primer design and amplifi- 
cation processes that are used to obtain 
the 16S rDNA gene sequences are also 
known to introduce biases and errors in 
the process, including the formation of 
chimeric products, biased rates of ampli- 
fication, and a failure to amplify some 
targets due to primer mismatch. 

Regardless of these limitations, 16S 
rDNA sequencing continues to produce 
a wealth of survey data. The latest update 
from the Ribosomal Database Project [16], 
(http://rdp.cme.msu.edu/; RDP Release 
10, Update 26 :: 28 March 2011) includes a 
total of 1,613,063 16S rRNAs from species 
that inhabit many different environments. 
Other public curated collections of 16S 
rRNA sequences that are available as on- 
line resources include GreenGenes [17], 
SILVA [18], and EZ-Taxon [19]. In the 
absence of better avenues for detailed 
comparative purposes, 16S gene-based 
approaches for characterizing microbial 
diversity have become popular tools for 
microbial ecology. Moving past a heavy 
reliance on Sanger-based full-length 16S 
rDNA gene sequencing where the approx- 
imate 1500 base-pair sequence was the 
goal, newer sequencing technologies have 
made it possible to generate hundreds of 
thousands of 16S genes from a single en- 
vironment, allowing the community to be 
studied in unprecedented detail. 

The newer generation of sequenc- 
ing technologies generate shorter 
read-lengths, and as such have been used 
primarily in large surveys and for the 
inference of microbial diversity in eco- 
logical samples [20]. Most of the studies 
that employ the use of next-generation 
sequencing technologies tend to focus 
on one or more variable regions for 
inference of the population diversity [21]. 
A multiple-primer sequencing strategy 


on the Illumina sequencing platform to 
obtain longer 16S sequences needed for 
taxonomic analysis of mixed populations 
of bacteria was recently described [22]. 
The large genomic sequence data sets in 
multiple data formats present novel com- 
putational challenges for the exploration 
of metagenomic sequence data and clas- 
sification of 16S rRNA sequences. Many 
new bioinformatic tools geared towards 
high-throughput analyses have been de- 
ployed to assess microbial phylogeny and 
the diversity of complex flora [23, 24]. The 
next-generation sequencing technologies 
offer the ability to sample genomes of or- 
ganisms that occur at low abundances in 
communities, at a much lower cost. This 
has resulted in a high discovery rate of pre- 
viously uncultured and novel organisms. 
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3] 
Oceans 


The oceans are teeming with microscopic 
organisms that have been characterized 
only to a limited extent, but whose actions 
are crucial to maintaining the biosphere. 
At the same time, the all-encompassing 
term “ocean” belies the complex and 
varied habitats that make up marine 
ecosystems serving to downplay the di- 
verse communities that exist within it. 
The GOS expedition, which started in 
2003 with a pilot project in the Sar- 
gasso Sea, is one of the first and largest 
environmental genomic studies. The ap- 
proach was to collect hundreds of liters 
of surface seawater and to filter this into 
different size fractions dominated by eu- 
karyotes (3.0 um), eukaryotes and bacteria 
(0.8 um), bacteria (0.1 um), and viruses. 
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Initially, the sequencing costs were the 
limiting factor, and consequently most 
attention was placed on the 0.1 um bac- 
terial fraction, where the gene density was 
expected to be greatest and the potential 
for confounding factors such as introns 
or repetitive elements would be mini- 
mized. Marine microbial communities are 
thought to be fairly complex, and to con- 
tain thousands of distinct taxa. Given that 
so little was known about these microbial 
communities, however, the goal was to 
circumnavigate the globe and to examine 
the genomic content of these communi- 
ties, explore the diversity present in these 
systems, and attempt to gain a better 
understanding of how microbes, commu- 
nities, and the environment interact. 

The initial pilot project, conducted in the 
Sargasso Sea, generated over 1.7 million 
Sanger sequencing reads, from which 
over 1.2 million peptide sequences were 
identified. Many of these peptides pro- 
vided a tremendous expansion in known 
protein families, as exemplified by the 
proteorhodopsin gene family. The prote- 
orhodopsins function as a fast, light-driven 
proton pump, and are thought to pro- 
vide an important (if limited) advantage to 
heterotrophic organisms in nutrient-poor 
environments, such as those found in the 
open ocean. Previous studies, including 
polymerase chain reaction (PCR)-based 
surveys had characterized a total of 67 dif- 
ferent proteorhodopsins from a handful of 
different clades. In the Sargasso Sea pi- 
lot study, 782 additional proteorhodopsins 
were identified that belonged to a variety 
of unsampled clades, further confirming 
the widespread distribution of this gene 
family in marine ecosystems. The expan- 
sion of the proteorhodopsins was typical 
of many characterized protein families. 
Approximately 65% of the new peptides 
matched previously characterized orphan 
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hypothetical genes, thus identifying en- 
tirely new protein families. The growth 
in existing protein families, and the iden- 
tification of new families, was both ex- 
citing and troubling, however, the major 
impediment when handling these large 
volumes of data being a lack of data 
mining tools to detect metagenomically 
isolated genes and to develop better mod- 
els for describing the function and role of 
gene families. Nonetheless, the Sargasso 
Sea pilot study ultimately demonstrated 
the power of metagenomics for identify- 
ing novel genes and linking them to their 
environment. 

The GOS expedition began officially 
in 2004, starting with a transect down 
the eastern seaboard, through the Gulf 
Stream and the Gulf of Mexico, across the 
Panama Canal, and ending just outside of 
the Galapagos Islands. Phase I included 
41 different locations covering a number 
of different biomes and spanning a wide 
range of environmental conditions. In a 
series of reports, the Phase I dataset was 
analyzed either from a community com- 
position perspective in which tools such 
as assembly, fragment recruitment, and 
comparative metagenomics were applied 
[25], or from a protein-content perspec- 
tive, where techniques for clustering the 
peptide data were developed to explore the 
growth of protein families and the diversity 
within protein families [26]. The overarch- 
ing result of these studies was that the 
diversity within the marine environment 
was tremendous. In fact, diversity within 
the taxonomic classification of species was 
so extensive that no two cells would be 
expected to be the same in terms of 
gene content. Within protein families, the 
breadth of the analysis led to the discov- 
ery of protein families in the Eubacteria 
that previously had only been associated 
with eukaryotes or the Archaea. Phase I 


also included studies of the viral fractions 
examining the co-occurrence of phage and 
host, along with the types of genes associ- 
ated with viral populations. 

Currently, the GOS expedition contin- 
ues to collect and analyze samples. As 
of 2011, the GOS includes hundreds of 
samples from the circumnavigation, along 
with additional samples from the west 
coast of North America, from the Antarc- 
tic Ocean and from Europe, including the 
Baltic and Mediterranean Seas. Recently, 
the Moore foundation sponsored the anal- 
ysis of a diverse set of 137 cultivated 
marine microbes isolated from across the 
globe. Many of these genomes were not 
well represented in the GOS, either be- 
cause they are found in different size 
fractions or because they were not abun- 
dant in marine surface waters. Those that 
were abundant constitute the “planktonic” 
fraction of the microbial community, and 
show adaptations to living in low-nutrient 
environments. Improvements in metage- 
nomic assembly and binning are helping 
to identify genomes for many of the un- 
characterized and uncultivated clades in 
the ocean [27]. 


3.2 
Air 


The atmosphere is a complex and very dy- 
namic environment that, to date, has been 
poorly characterized in terms of its biolog- 
ical content. In a typical day, an average 
adult breathes in over 10,000 liters of air 
while simultaneously exposing themselves 
to the array of organisms that use the air 
as a mechanism for their dispersal. Air 
plays host to a diverse array of organisms, 
including pollen granules, resilient fungal 
and microbial spores, aerosolized bacteria, 
and viruses, which are either free-floating 
or attached to small dust particles. That 


these microbial entities have an impact on 
human health is well accepted, yet much 
concerning the diversity, distribution, and 
activity of these airborne organisms is un- 
clear. For example, do indoor and outdoor 
environments differ in their biological 
composition? Have microbes adapted to 
live in climate-controlled environments? 
Do different indoor environments host dif- 
ferent airborne communities? Does this 
result in a greater exposure of humans 
to organisms possessing virulence factors 
and pathogenicity islands? In the longer 
term, it would be beneficial to know how 
the bio-aerial community can be effectively 
monitored, and to determine effective 
methods of intervention that would both 
minimize exposure to potentially harm- 
ful organisms and increase exposure to 
protective organisms. 

The challenge when studying the aerial 
microbial community is that the atmo- 
spheric load is approximately 10 bacteria 
per liter, while that of seawater may be in 
excess of 10°. At issue here is the need for 
sufficient genetic material to perform se- 
quencing and other molecular techniques, 
ideally without having to resort to am- 
plifying the DNA, which can —- and will — 
introduce biases. Furthermore, such small 
quantities of biological material make it 
easy to introduce contaminants that can 
swamp any real signal. Air samples also 
contain a variety of nonbiological materi- 
als, including small dirt, metal, and dust 
particles that must be removed but which 
initially are either attached to or mixed with 
the microorganisms. Finally, many organ- 
isms are contained in pollen granules and 
spores that are extremely resistant to lysis. 
Given all of these issues, it is especially im- 
portant for research groups to design their 
experiments and controls very carefully, in 
order to ensure that the data acquired by 
these steps are both reliable and unbiased. 
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Given all of these challenges, it is not 
surprising that only a small number of 
indoor or outdoor air studies have been 
conducted. In the largest air metagenomic 
study carried out to date [28], samples of 
indoor air from two shopping centers in 
Singapore were compared with nearby soil 
and water samples. This data showed both 
air samples to be less diverse, and distinct 
from the other environmental samples. 


3.3 
Host-Associated Microbiomes 


Host—microbiome studies have provided 
insights into the interactions between 
species and their mutualistic inhabitants. 
In theory, they should also allow a distinc- 
tion to be made between microbes that 
are triggers for disease conditions, neutral 
organisms that are “along for the ride,” 
and microbes that are beneficial. Many of 
these studies have focused on the micro- 
bial diversity associated with the intestinal 
community. For example, a little-known 
fact is that some of the earliest metage- 
nomic studies were actually focused on 
the microbial diversity of the rumen [6]. 
In this case, Bryan White and colleagues 
at the University of Illinois, along with 
colleagues at the JCVI in Rockville, MD, 
USA, were funded by the US Department 
of Agriculture (USDA) to study the im- 
pact of diet on changes in the microbial 
populations of the rumen in cattle [6]. 
More recently, Kim et al. [29] tallied 
the microbial diversity estimates from 
all available rumen studies, based on a 
meta-analysis of all available curated 16S 
rRNA gene sequences deposited in the 
RDP database. As of November 2010, 
totals of 13,478 bacterial and 3,516 ar- 
chaeal rRNA sequences had been de- 
scribed, that could be assigned to 5,271 
species-level operational taxonomic units 
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(OTUs) that represented 19 existing phyla. 
The Firmicutes, Bacteroidetes, and Pro- 
teobacteria were found to be the most 
predominant phyla. In total, 180 existing 
genera were represented, with nearly all of 
the archaeal sequences being assigned to 
the phylum Euryarchaeota and represent- 
ing 12 existing archaeal genera. The survey 
results suggested that there is a significant 
amount of microbial diversity yet to be de- 
scribed at the metagenomic level, and that 
this will hopefully be achieved with the 
new advances in sequencing technology. 

Recent studies [7] to characterize the 
phylogeny and functional diversity of the 
canine gastrointestinal microbiome were 
the first of their type. In this case, fecal 
material from healthy adult dogs fed a 
low-fiber control diet (K9C), or a diet con- 
taining 7.5% beet pulp (K9BP), was used 
for 454 pyrosequencing. The dominant 
phyla included the Bacteroidetes/Chlorobi 
group and Firmicutes (which comprised 
about 35% of all sequences), followed by 
Proteobacteria (13-15%) and Fusobacteria 
(7-8%). The archaeal species identified in 
this study were not influenced by diet, and 
represented approximately 1% of all se- 
quences. Less than 0.4% of the sequences 
were of viral origin, and more than 99% of 
these were associated with bacteriophages. 
Interestingly, the clustering of several gas- 
trointestinal metagenomes demonstrated 
similarity between dogs, humans, and 
mice. 

White and colleagues at the JCVI have 
also recently reported their findings on 
nonhuman primates. A comparative anal- 
ysis of the microbial populations derived 
from the gastrointestinal microbiomes of 
nonhuman primate species may possi- 
bly provide a better understanding of the 
evolution of nonhuman primate diver- 
sity. Based on an analysis of fecal sam- 
ples from three different wild nonhuman 


primate species, namely black-and-white 
colobus (Colubus guereza), red colobus 
(Piliocolobus tephrosceles), and red-tailed 
guenon (Cercopithecus ascanius), the Fir- 
micutes were shown to comprise the vast 
majority of the phyla, while the microbial 
community composition was more simi- 
lar within the same species than among 
different species. Moreover, when com- 
pared to humans, the microbial diversity 
was distinct and related to the host phy- 
logeny. Diet was also seen to be a possible 
influence on community structure. 


3.4 
Human Body 


In humans, the microflora represents an 
integral part of the genetic landscape and 
evolution. It has been estimated that the 
number of microbial cells resident in the 
human body outnumbers those of the host 
by a factor of at least 10 : 1, where complex 
and diverse species have been identified 
in the different body sites. The results 
of previous microbiome studies have also 
shown that microbial communities vary 
within and between human hosts. For 
example, the analysis of fecal microbial 
communities of obese and lean twins have 
shown that, despite a shared “core micro- 
biome” at the gene level within family 
members, phylum-level changes (along 
with a host of other factors) influence 
the overall metabolic profile of an individ- 
ual [30]. Another recently conducted gut 
microbiome study highlighted the influ- 
ence of diet in shaping the structure and 
functional capabilities of the gut micro- 
biota [31]. Evidence from this and other 
studies has suggested that bacterial lin- 
eages are unique to the host, and a “‘core 
microbiome”’ is largely defined through 
ecological interactions within hosts, as 
well as host variability in terms of dietary 


nutrients, genotype, illness, antibiotic use, 
and colonization history [32]. 

Currently, several microbiome stud- 
ies are ongoing designed to examine 
any differences between the microbiomes 
of healthy patients and those of pa- 
tients suffering from disease, under 
the umbrella of the International Hu- 
man Microbiome Consortium (IHMC) 
(www.human-microbiome.org) and the 
Human Microbiome Project (HMP). The 
goal of the IHMC is to create a com- 
mon set of guiding principles and policies, 
and to generate shared data resources that 
would enable research groups to conduct 
comprehensive studies on the human mi- 
crobiome. Currently, several groups world- 
wide are participating in human micro- 
biome research, conducting studies on var- 
ious aspects of the microbiome in healthy 
adults and its effects on disease states. 
Members of the IHMC include agencies 
from Canada, United States, European 
Commission, Japan, and Australia. The 
NIH initiated the HMP project as an ini- 
tiative of the NIH Roadmap for Biomedi- 
cal Research (http://nihroadmap.nih.gov), 
with the goal of comprehensively char- 
acterizing the microbial flora within and 
on the surface of the human body, and 
to understand the role of the microbiome 
in human health and disease. The objec- 
tives, timeline, and complete details of 
implementation of the project have been 
described in the maker report [33]. 

The METAHIT consortium (http:// 
www.metahit.eu/), which is an active par- 
ticipant in the IHMC, reported the first 
large-scale metagenomic study of gut mi- 
crobiome of 124 human subjects of Eu- 
ropean origin [34]. For this, the authors 
analyzed fecal samples using Illumina se- 
quencing technology, and reported the 
characterization of 3.3 million nonre- 
dundant microbial genes, which were 
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largely shared among the individuals. The 
microbial gene catalog generated through 
this project was approximately 150-fold 
larger than the human gene complement. 
Most recently, the MetaHIT consortium 
reported an additional analysis of previ- 
ously published data sets from 22 newly 
sequenced fecal metagenomes of individu- 
als originating from four countries across 
continents. In this case, the data were clas- 
sified into three clusters, mostly driven by 
species composition, and referred to as dis- 
tinct enterotypes, which suggested the exis- 
tence ofa limited number of well-balanced, 
host-microbial symbiotic states [35]. 

During recent years, through the HMP 
and other similar projects around the 
globe [36], access to the human mi- 
crobiome and an understanding of its 
influence on health and disease has 
become of utmost importance. Today, 
the HMP has generated metagenomic 
sequence data from 15 to 18 body 
sites of 300 “normal” individuals, many 
of whom have become repeat sample 
donors. In parallel to the metagenomic 
work, culture-based, nonculture-based, 
and single-cell approaches have allowed 
access to the genomes of reference species 
that will provide insight into the genetic 
diversity of these microorganisms, as well 
as acting as a scaffolding for the metage- 
nomic datasets [37]. 

The first large-scale report from the 
HMP consortium described 178 reference 
genomes that were generated as a result 
of much sequencing and annotation effort 
[33]. From these genomes, approximately 
547,968 polypeptides that were greater 
than 100 amino acids in length were iden- 
tified, of which 29,987 were unique; that 
is, they had not been identified previously 
when the datasets were compared against 
all publicly available data. Although, at 
first sight, this might appear to be a large 
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dataset, it is known that the genomes 
included in this initial study — despite 
being the largest conglomeration of se- 
quenced genomes in a single publication 
— represent but a subset of the species 
associated with the human body, based 
on the interrogation of available 16S sur- 
veys of the human body. The continued 
data-mining of these and other datasets 
that are being generated by the HMP and 
international consortia (see Ref. [35]), as 
well as by other smaller groups, is an- 
ticipated to reveal additional significant 
details of gene clusters, antibiotic markers, 
plasmids, and phages. 


3.5 
Virome Studies 


Although, most of the initial microbiome 
research (especially gut microbe studies) 
were focused on exploring the eubacterial 
community, archaea, viruses, fungi, and 
other microbes have also been frequently 
detected in environmental samples [38]. 
The discovery of RNA viruses in healthy 
human fecal samples [39] led to a general 
acceptance that the normal range of 
microbes in the human gut includes 
a wide variety of viruses, including 
enteric viruses, bacteriophages, and many 
known and unknown human viruses. 
Advances in sequencing technologies 
and the subsequent development of novel 
bioinformatic methods have allowed 
clinical research groups to increasingly 
adapt genomics-oriented approaches for 
translational medicine. Today, metage- 
nomic sequencing has opened new 
avenues for the discovery of novel viral 
species from clinical specimens, and also 
provides opportunities for understanding 
the etiology of unexplained illnesses. 
Deep-sequencing analyses of viruses, 
aided by metagenomics-based tools, have 


allowed an in-depth understanding of 
host—pathogen interactions to predict the 
existence of emerging and re-emerging 
virus-borne infectious diseases [40], and 
to examine the role of common viral infec- 
tions interacting with the host microbiome 
in the context of complex diseases [41]. 


4 
Single-Cell Genomics 


Single-cell genomics is a relatively new, 
but very powerful, technique that can be 
used to examine the DNA and (poten- 
tially) also the RNA content of one cell, 
in simultaneous fashion. It has been esti- 
mated that more than 99% of the microbes 
that exist on Earth currently elude stan- 
dard cultivation. Hence, the major benefit 
of the single-cell technique is that it pro- 
vides an ability to isolate one cell at a 
time from a complex environmental sam- 
ple, to amplify its genomic contents and, 
eventually, to sequence that genome. At 
its core, single-cell genomics promises to 
‘digitize’ microbial ecology by allowing 
research groups to examine entire com- 
munities at their most granular level — 
that of the individual cell. Today, individ- 
ual cells can be isolated relatively easily 
by dilution, or preferably by cell sorting 
using flow cytometry. The latter technique 
is of great potential, since in theory it can 
be used in conjunction with various stains 
for the selective isolation of cells of inter- 
est from a heterogeneous population. In 
this case, the cells of interest might either 
be actively dividing or carrying out spe- 
cific metabolic processes. After isolation, 
the cells are lysed and then undergo mul- 
tiple displacement amplification (MDA) to 
provide a billion-fold amplification of the 
DNA to an abundance that will support 
sequencing; the MDA is accomplished 


using the phiX polymerase that is both 
fast and accurate. After amplification the 
individual cells are tested for the presence 
of the 16S or 18S rRNA gene; this not 
only allows the research team to pick and 
choose which cells to take for further analy- 
sis, but also permits the generation of large 
banks of amplified single cells, and the se- 
lection of targets of interest while avoiding 
potentially redundant sequencing. 

The use of single-cell genomes over- 
comes one of the greatest drawbacks of 
shotgun metagenomic techniques — no- 
tably, that genes and pathways can be 
identified but often cannot be explicitly 
linked one to another, principally because 
of the diversity within microbial com- 
munities and populations. This lack of 
context means that it is very difficult to de- 
termine the overall biological capabilities 
of uncultivated organisms and, by exten- 
sion, to build accurate models of entire 
communities. The sequencing of single 
cells explicitly determines this context, and 
promises to provide a much better view of 
the diversity and capabilities of organisms 
within a community. 

Nonetheless, single-cell genomics is a 
young field in which many technical chal- 
lenges have still to be overcome. In order 
to minimize damage to the DNA, the cell 
lysis must be performed very gently; un- 
fortunately, however, many cells require 
rather harsh techniques to ensure an ef- 
ficient lysis, and therefore this technique 
may not be amenable to some classes of 
microorganisms. As the initial MDA is 
very sensitive to the starting conditions, 
the genome may very often be amplified 
unevenly, perhaps many thousand-fold. 
Whereas, normalization and low-cost se- 
quencing can partially compensate for 
such an uneven coverage, any individually 
amplified genome will typically contain 
between 10% and 90% of the complete 
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genome. This problem necessitates the 
sequencing of multiple cells, and gener- 
ally reduces the efficiency of the approach. 
Finally, due to the extreme sensitivity of 
MDA, even the smallest levels of con- 
tamination may become problematic, and 
every batch of enzymes and new molecu- 
lar techniques must be carefully tested to 
ensure purity. The development of pro- 
tocols to address these issues, and to 
identify only the best, most evenly ampli- 
fied contaminant-free genomes, is key to 
the future success of single-cell genomics. 


5 
Sequence Technologies and Tools 


The rapid advancement of sequencing 
technologies during recent years has re- 
sulted in a number of production-ready 
“Next Generation” (NextGen) sequencing 
instruments. These NextGen platforms 
have largely replaced the capillary-based 
Sanger instruments, and have become the 
primary tools for producing sequence data 
for the majority of applications in the 
field of biology and genomics, including 
the study of microbiomes. The NextGen 
instruments allow a large amount of se- 
quence data to be generated in a mas- 
sively parallel manner, and thereby signif- 
icantly increase the experimental output 
to the gigabasepair (Gbp) level per ex- 
periment, and at a significantly lower 
cost compared to Sanger-based technolo- 
gies. Examples of these instruments in- 
clude the Roche/454 GS FLX (Genome 
Sequencer), [Ilumina/Solexa GA (Genome 
Analyzer), and ABI SOLiD (Supported 
Oligonucleotide Ligation and Detection) 
platforms. Rapid and significant techno- 
logical advances continue to be made to 
these platforms in terms of the quantity 
and the quality of the sequence data 


163 


164 


Microbiomes 


generated, and their utility for a wider set 
of experimental design and application. 

Furthermore, several other approaches 
are currently in development that promise 
to increase the speed and reduce the 
costs of sequencing even further. As a 
consequence, the generation of genome 
sequence data is no longer proprietary 
to large genome centers, as smaller re- 
search laboratories will have access to 
NextGen sequencing machines. The in- 
terpretation and extraction of useful bi- 
ological data from the large amount of 
NextGen sequence data generated from 
microbiome samples remains a major 
challenge, however, due to the complexity 
of the microbiome samples and the lack 
of standards. Today, approaches and com- 
putational tools for analyzing the NextGen 
microbiome sequence data are still in the 
process of being defined and developed, 
pushing costs away from the data genera- 
tion stages into the data analysis. 


5.1 
454 (Roche) GS Sequencing Technology 


The first of the NextGen platforms 
to become commercially available, in 
2005, was the 454 GS system, a novel 
and highly parallel system based on 
pyrosequencing chemistry [42]. In this 
system, sample preparation involves the 
fragmentation of genomic DNA, followed 
by the ligation of adaptor sequences 
and clonal amplification of the target 
DNA on micron-sized beads, using an 
emulsion-based PCR (emPCR) method 
[43]. The sample preparation process is 
much more simple than that of Sanger 
sequencing, which contributes to not 
only an improved  cost-effectiveness 
but also a significant improvement in 
throughput. The sequencing-by-synthesis 
reactions are performed directly on 


the template-carrying beads, which are 
preloaded into a microfabricated glass 
plate containing 1.6 million reactor 
wells, each of volume 1 pl. In order to 
determine the DNA sequence, the four 
nucleotides are delivered sequentially 
through the plate, while a charge-coupled 
device (CCD) camera detects those wells 
from which light is emitted as nucleotide 
incorporation occurs. 

Currently, a single 454 GS FLX ma- 
chine run, using the Titanium reagents, 
produces over 400 Mb of sequence data 
with reads of approximately 375 bp in 
length from over 1.2 million distinct 
sequencing-by-synthesis reactions, which 
take about 7 h to complete. Since its 
entry to the marketplace, the 454 pyrose- 
quencing technology has been adapted to 
a wide variety of microbiome applications, 
including shotgun, 16S rRNA gene, and 
transcriptomic sequencing, so as to gener- 
ate over 300 published reports. 


5.1.1 Illumina/Solexa GA Sequencing 
Technology 

The Illumina GA IIx and HiSeq 2000 sys- 
tems have the highest throughput, are the 
most widely used NextGen platforms avail- 
able, and employ the massively parallel 
sequencing-by-synthesis approach. Sam- 
ple preparation process in this system 
involves the fragmentation of genomic 
DNA, followed by the ligation of adaptor 
sequences. DNA template amplification is 
achieved by the solid-phase bridge PCR 
method, which allows amplification on 
a solid substrate in a free-flowing cell 
by virtue of a localized amplification of 
genome fragments; these are referred to as 
clusters that have been adaptor-ligated [44]. 
The resulting PCR continues to amplify in 
a localized fashion to form a discrete clus- 
ter of a single molecular species; the latter 
is then used for a primer-mediated base 


incorporation using fluorescently labeled 
terminator bases and a modified poly- 
merase. The use of fluorescent dye termi- 
nator bases results in a marked reduction 
of insertion and deletion sequencing er- 
rors as compared to the 454 technology. 
However, the use of a modified poly- 
merase and nucleotides may result in 
base incorporation errors, thus increasing 
the number of substitution bases found. 
The sequence detection process, via the 
imaging of polymerase-incorporated fluo- 
rescently tagged bases, results in over 600 
billion bases (HiSeq 2000) of sequences 
per instrument run, with read lengths of 
up to 150 bp. Due to their large capacity 
and low sequence production costs, the II- 
lumina GA IIx and HiSeq have become the 
most widely used platforms for several mi- 
crobiome applications, including shotgun 
and transcriptomic sequencing sufficient 
to generate almost 20 published reports. 


5.1.2 ABI SOLID Sequencing Technology 

The ABI SOLiD analyzer utilizes a 
hybridization—ligation-based approach for 
sequence detection, and combines the op- 
timality of PCR on bead with an open 
interface for the detection of a large num- 
ber of sequences. Much like the 454 
process, single adaptor-ligated genomic 
fragments are amplified by emulsion PCR 
on beads, which are then deposited onto 
a glass slide. The bases are detected by 
successive cycles of primer hybridization 
and detection probe ligation; this is fol- 
lowed by cleavage of part of the detection 
probe to release fluorescence in one of 
four colors that can be used to detect the 
dinucleotide bases incorporated into the 
detection probe. The system has a through- 
put of more than 50 billion bases per run, 
at a read length of 35-75 bp. The ABI 
SOLiD platform has also been adapted 
to a number of microbiome applications 


Microbiomes 


including metagenomic sequencing and 
microRNA expression analysis. 


5.2 
Assembly 


Assembly is a standard bioinformatics 
treatment for shotgun sequence data. 
Whereas, the shotgun method [45] enables 
high-throughput sequencing by sampling 
DNA in random fashion, the shotgun as- 
sembly method [46] reconstructs genomic 
sequences from the random samples. The 
assembled sequences are longer than the 
individual reads, and they can be more ac- 
curate as they represent the consensus of 
redundant reads. Thus, assembly enriches 
both sequence length and accuracy, and 
these enrichments act to improve the yield 
from any subsequent sequence analysis, 
including gene recognition and pathway 
detection. 

The results of an assembly are presented 
as contigs and scaffolds. Contigs are mul- 
tiple sequence alignments of reads, and 
have a consensus sequence, whereas scaf- 
folds (or super-contigs) are collections of 
contigs. Linear scaffolds, which are the 
most common type, define the relative 
order and orientation of contigs, plus 
the gap sizes between contigs. The gap 
sizes are estimated from the expected read 
pair separation distances. Whereas, con- 
tigs summarize one or more underlying 
reads at every position, scaffolds include 
gaps to summarize any missing or unre- 
solved sequence. 

In the GOS expedition, water was col- 
lected from the Atlantic and Pacific Oceans 
and, after size filtration for bacterial cells, 
DNA samples were sequenced on Sanger 
sequencing platforms at the JCVI. The data 
were subject to many analyses, includ- 
ing assembly, with one of the assembly 
strategies employing the Celera Assembler 
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[47]. The software in this case was param- 
eterized to be very aggressive, accepting 
even nucleotide sequence identity as low 
as 84% as an indication of common ori- 
gin of two reads. In fact, the software 
required modification to operate at such 
low thresholds. With a species abundance 
too low, there was little hope of recovering 
individual genomes; instead, sequences 
were constructed that might represent di- 
verse strains, whole species, or even larger 
clades. 

The assembly strategy generated several 
benefits; notably, it connected many novel 
GOS proteins (which otherwise might 
have been impossible to assign to taxa) 
to 16S or recA taxonomic markers [2]. 
Some genes were linked to others, includ- 
ing a type-II glutamine synthetase; at that 
time, this gene was presumed to have a 
type-I variant in prokaryotes only, and a 
type-II variant in eukaryotes only. How- 
ever, assembly linked the type-II variant 
convincingly to other prokaryotic genes; 
otherwise, the gene might have been re- 
garded as eukaryotic contamination. Fi- 
nally, the assembled contigs were used to 
nucleate gene-specific clusters in a down- 
stream bioinformatics analysis of the reads 
[48]. 

Unfortunately, assembly involves risk; 
even on single-genome data sets assem- 
bly algorithms can omit repeated genome 
segments [49] or combine unrelated se- 
quences that happen to share a repeated 
segment. These are also problems of 
nonassembly and mis-assembly, or rather 
false-negative assembly and false-positive 
assembly. These problems are exacer- 
bated by the added complexity found in 
metagenomics data sets. Bork and col- 
leagues [50] noted that microbiome com- 
munity complexity induced a low cover- 
age of individual species, and that even 
individual species present polymorphism 


between the genomes. Likewise, Pop [51] 
noted that variable abundance and poly- 
morphism in metagenomics data chal- 
lenged the traditional assembly software, 
while Kunin and colleagues [52] reported 
that genomic repeats would lead to as- 
sembly software to generate chimeric se- 
quences from metagenomics data. Rusch 
and coworkers found likely chimera in 
the GOS assemblies [2], and later re- 
constructed putative whole genomes that 
assembly had missed. For complex micro- 
biomes in particular, assembly-free anal- 
ysis can reduce the exposure to risk. 
Assembly-free analysis methods include 
clustering reads according to their similar- 
ity with reference sequences [48], fragment 
recruitment to known sequences [53], phy- 
logenetic profiling [54, 55], ab initio gene 
prediction, and gene recognition by ho- 
mology [56]. 

It may be illuminating to challenge 
assembly software with data sets for which 
the underlying genomes are known. As 
simulation rarely captures all of the nu- 
ances of life, these challenges may define 
minimal requirements. MetaSim [57] is 
useful software for simulating reads given 
reference sequences, a relative-abundance 
model of the community structure, and 
error models for particular sequencers. 
An early test of simulated metagenomics 
Sanger reads showed that, for the three 
assemblers tested, only the longest 
contigs were reliable [58]. Recently, the 


Human Genome Project challenged 
various assemblers with real reads 
from mock communities that had 


been generated by the combination of 
bacteria of which the genomes had been 
sequenced, including some closely related 
genomes. One even-abundance and one 
staggered-abundance community were 
sequenced using Illumina technology. 
When six assemblers were tested, 


including ABySS [59], CABOG [60], 
CLC (http://clcbio.com), Newbler 
(http://454.com), SOAP de novo, and 
Velvet [61, 62], the packages each per- 
formed similarly. Across all assemblies, 
most contigs that were 300 bp or 
longer had full-length alignments to a 
reference: 85% in even-abundance and 
80% in staggered-abundance, average 
per assembly. There were between zero 
and 10 chimera per assembly, defined as 
having more than one nonoverlapping 
alignment to a reference that was 1000 
bp, or longer; however, none of the 
assemblers even came close to capturing 
one complete genome. The maximum 
contig size per assembly was, on average, 
272 kbp in even-abundance and 148 
kbp in staggered-abundance; hence, it 
would appear that staggering the genome 
abundance levels increased the assembly 
difficulty. It should be noted that all of 
the assembly software tested had been 
developed for single-genomes, and not for 
metagenomics. 

Currently, metagenomics-specific 
assembly software is either lacking or 
immature. Since assembly software 
development is a lengthy process, 
most large-scale microbiome projects 
have re-used existing software that was 
designed for single-genome applications 
only. The MetaHIT Illumina reads 
collected from human stool samples 
were assembled with the SOAP de novo 
software, which had been developed for 
single-genome data [63, 64]. The GOS 
Sanger reads collected from ocean water 
were assembled using a modified version 
of the Celera Assembler software [1, 
2], which had also been developed for 
single-genome data [48]. Microbiome 
assembly should become more sensitive 
and accurate as computer scientists 
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develop algorithms and software that is 
specific to the task. 

In the near future, microbiome assem- 
bly will be challenged by larger data sets 
as sequencing efficiency continues to im- 
prove. Assembly software developers may 
respond to this situation by incorporating 
more sophisticated distributed processing 
models, or by relying on the increas- 
ing availability of powerful, multi-core, 
high-RAM computer servers. Software 
may also grow to exploit the increas- 
ing relevance of data beyond the shotgun 
reads themselves. This includes reference 
genome sequences, transcript sequence 
data, and single-cell genome sequence. 


5.3 
Data Analysis 


5.3.1 Fragment Recruitment 

Today, whole-genome and metagenome 
studies provide a clearer picture of the 
conservation and diversity found within 
and across microbial species. Complete 
reference genomes are powerful tools for 
exploring the metabolic and functional ca- 
pabilities of an organism, and provide a 
snapshot of the diversity that is present in 
any single isolate. As multiple strains of 
bacteria from the same species have been 
sequenced it has become clear how dif- 
ferent even very closely related organisms 
can be [65]. By measuring the “genomic 
fluidity” of a population, it is increasingly 
possible to estimate the gene content asso- 
ciated with a particular clade of microbes 
[66]. Because metagenomic studies do not 
require cultivation they can reveal the true 
extent of the diversity in a particular popu- 
lation and environment. The technique 
of fragment recruitment is a powerful 
approach for measuring both the abun- 
dance and the diversity of organisms in 
a community relative to known reference 
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sequences [25]. Moreover, when applied to 
metagenomic assemblies, it may be used 
as a tool for analyzing the accuracy of the 
assembly and to bin scaffolds into genomic 
units. 

The approach to fragment recruitment 
is technically simple: the sequencing reads 
are aligned at low stringency to complete 
and draft genomes using a favorite se- 
quence alignment tool (typically BLASTN). 
A simple filter removes any poorly aligned 
reads, after which the remaining align- 
ments are plotted to show the position 
(on the reference sequence) and the iden- 
tity of the alignment (number of correctly 
aligned bases divided by the length of 
the read). The value of these simple per- 
centage identity plots [67] can be further 
enhanced by coloring individual reads, 
based on various forms of metadata. This 
additional information is typically envi- 
ronmental metadata depicting the sample 
and/or environment characteristics asso- 
ciated with a specific read. Alternatively, 
if mate-paired reads were to be recruited, 
the additional information could be in the 
form of structural metadata describing the 
relative placement and orientation of the 
paired reads. Structural information can 
indicate whether paired reads are closer or 
further away than expected, and indicate 
a possible deletion or insertion relative 
to the reference. Although the fragment 
recruitment concept is straightforward, 
these plots contain a depth of information 
that belies their simple origins. 

Recruitment plots can be both an at- 
tractive and elegant means of organiz- 
ing metagenomic information. Generally, 
recruitment provides useful information 
only when the metagenome contains reads 
from the same species or genus as one or 
more of the references genomes. Given a 
suitable reference, the recruited reads will 
be more or less evenly distributed along 


the length of the reference, and the density 
(number of reads per kilobyte) [66-68] 
will be indicative of the abundance of that 
genus or species in the environment. In 
general, identity indicates how closely re- 
lated — and therefore how representative 
— a genome is of the environmental or- 
ganisms. Typically, there are gaps in the 
recruitment where tens or hundreds of 
kilobases of the reference will have few, if 
any, recruited reads; these correspond to 
hypervariable segments that appear to be 
strain-specific and contain largely unchar- 
acterized proteins [25, 68]. In some cases 
there will be segments of the genome 
that recruit reads from only a subset of 
samples, thus highlighting any potential 
geographically or environmentally adap- 
tive genes. The plots can be become very 
complicated as the abundance of related 
organisms goes up, and as the number 
of related strains in the community in- 
creases; moreover, artifacts may emerge 
due to the recruitment process. Virtually 
every microbial genome will have some 
recruitment, not because they are present 
in a particular metagenomic sample but 
rather because conserved genes and mo- 
tifs will align to even very distantly related 
genes from other organisms. 

The application of fragment recruitment 
to assemblies derived from a metagenome 
can be used to validate scaffolds or to 
identify chimeric joins that can, in the- 
ory, be broken to improve the assembly. 
Valid scaffolds will have the same general 
distribution and abundance of reads along 
their lengths; furthermore, there should be 
large numbers of “‘good mate pairs” that 
are the correct distance apart and properly 
oriented, distributed along the length of 
the scaffold. The presence of high-identity 
good mates almost ensures that at least one 
clone from the environment will support 
that specific scaffold layout. Unfortunately, 


however, the converse is not true — poorly 
mated, high-identity reads do not invali- 
date a specific scaffold layout, but rather 
suggest that there may be other valid lay- 
outs (variants) in the environment. 

The Advanced Recruitment Viewer, 
which is available through the JCVI web- 
site, provides a mechanism to view pre- 
computed recruitment plots for GOS, 
HMP, and other metagenomic studies. It 
also provides a number of basic capabil- 
ities for assessing abundance, examining 
recruitment in the context of existing an- 
notation, viewing structural metadata, and 
exporting recruited reads of interest. 


5.4 
Data Management 


Data generation and management dur- 
ing the sequencing of microbiomes can 
be divided into three distinct stages de- 
pending on the datatypes, the methods 
used for generating each type, and the 
required storage space. The first stage in- 
volves collection of the raw instrument 
data; these are image files containing 
light-intensity values emitted by the flu- 
orescent nucleotides added during each 
sequencing cycle, and captured by the im- 
age sensor of the instrument (for a detailed 
review of sequencing chemistry, see Ref. 
[69]). At the second stage, data are ob- 
tained from the quantification and quality 
check of the intensity values, and then con- 
verted to the corresponding nucleotides 
in order to acquire sequence reads, us- 
ing algorithms specific to each sequencing 
technology. The manufacturer usually im- 
plements these algorithms in software 
packages supplied along with the sequenc- 
ing instrument. The third and final stage is 
of the greatest scientific value, and is where 
meaningful sequence data are extracted 
from the generated data. This involves 
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running assembly algorithms [70] with the 
sequence reads as input in order to recon- 
struct the original genome, followed by an 
analysis through bioinformatic pipelines 
for gene calling, functional annotation, 
and the discovery of other interesting bio- 
logical features on the genome. 

These stages are interdependent, and 
two opposing trends are observed as se- 
quencing data are processed and moved 
from the first to the third stage. One trend 
is to increase the scientific value of the 
data, while the other is a reduction in size 
and storage requirements. More specifi- 
cally, the raw image files have no value 
for scientific analysis after quantification 
and conversion to sequence reads, except 
if used as reference for a quality check 
of the reads which, nonetheless, would be 
impractical given their data size. On the 
other hand, sequence reads can be used 
as input to assembly algorithms, in order 
to reconstruct the complete genome se- 
quence of an organism; alternatively, they 
may be mapped to annotated genomes for 
polymorphism discovery and the quan- 
tification of gene expression in RNAseq 
experiments. As can be seen from the 
data relating to various sequencing instru- 
ments in Table 1, moving from stage one 
to stage two results in a data reduction of 
between a 22-fold (Illumina II) and three- 
fold (SOLID 3) for both fragment and 
paired-end libraries. Finally, assembled 
metagenomes have the most value as they 
can be used to discover new full-length 
genes, and can be used for comparative 
and evolutionary analysis with cultivated 
genomes. 

Since the introduction of the ABI SOLiD 
instrument in 2007, sequencing technolo- 
gies have continued to move in a direction 
where throughput is increasing while the 
cost per sequenced DNA base is decreas- 
ing; in fact, the costs have fallen by half 
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Tab. 1 
of data analysis, for the different sequencing 
technologies. 


Storage requirements at the three stages 


Data units: gigabytes (GB) Roche/454 Illumina/Solexa ABI/SOLID 
GS20 FLX Ti | iT] IIx 1 2 3 
Stage one images Fragment 10 10 30 500 1100 2800 1800 2500 1900 
Paired-end 2 10 30 1000 2200 5600 3600 5000 3800 
Stage two sequence Fragment 0.5 4 30 50 b 100 140 600 
Paired-end —* 1 4 60 100 ° 200 280 1200 


*This instrument cannot generate paired-end libraries. 


Data are not available. 


The first row of numbers at each stage refers to fragment data, while the second row is for 
paired-end library data [71]. This instrument cannot generate paired-end libraries. 

Source: Genome Center at Washington University, Dooling 2010, 
http://www.politigenomics.com/next-generation-sequencing-informatics. 


at about every five months. Furthermore, 
from the information provided in Table 1 it 
is apparent that data production on the or- 
der of terabytes per run (1 TB = 1000 GB) 
is becoming the norm for next-generation 
sequencing; hence, it would be possible 
to overwhelm the storage capacity of most 
laboratories with only a few sequencing 
runs. Consequently, significant costs for 
data storage must be considered in addi- 
tion to the expense of acquiring the se- 
quencing instrument. Recently, prices per 
gigabyte have been following a downward 
trend, similar to the cost per sequenced 
DNA base, but this has occurred at a re- 
duced rate, being halved about every 14 
months [72]. Despite the price drop in 
sequencing, which is expected to lead to 
the “US$ 1000 personal genome,” the in- 
vestment required for storage systems and 
the overall costs for bioinformatic infras- 
tructure may increase this figure by many 
orders of magnitude. 

The storage price per gigabyte also de- 
pends on the technology used. At one end 


of the spectrum are consumer-grade exter- 
nal hard drives that cost approximately 
US$ 150 per TB (first quarter of year 
2011), whereas at the opposite end are 
highly scalable clustered storage systems 
with automatic replication, archiving, and 
disaster recovery, but costing hundreds of 
thousands of dollars. Many intermediate 
cost tiers exist between these two extremes, 
however, with systems having different 
combinations of capacity, performance, 
and data reliability characteristics. A key 
guiding principle when designing stor- 
age infrastructures is the scientific value 
placed on each of the three data stages of 
next-generation sequencing, which in turn 
determines the investment amounts in the 
different storage tiers. More specifically, 
an ideal configuration would be composed 
of large amounts of cheap storage attached 
to the sequencing instrument; these would 
hold the bulk of the stage one image data 
on a temporary basis, until the data are 
processed and quantified into sequence 
reads. The second component of the sys- 
tem would hold the stage two sequence 


data; this would consist of high-speed disks 
because the performance is critical and 
this is where most of the computation, 
including assembly, mapping, and anno- 
tation, would take place. The second com- 
ponent must also be made disaster-proof 
by using replicated disk systems, with the 
sequence data and analysis results (stage 
three data) safeguarded against hardware 
failures, as these will be used for scientific 
publications and, in most cases, released 
to the public following completion of the 
project. A third and final component is 
that of archival storage, which is used for 
the long-term preservation of almost all 
data generated from sequencing projects. 
This entails two key characteristics: first, 
to provide large volumes of storage at 
very low cost; and second, to include a 
disaster-proof status. The first character- 
istic is achieved by using disk drives or 
even magnetic tape — that is, not highly 
networked in order to achieve a low cost. 
The second characteristic can be achieved 
through the off-site replication of data, 
or by using the services of commercial 
vendors who offer the preservation of mag- 
netic tapes in disaster-proof vaults. An- 
other option for archival storage has, until 
recently, been NCBI’s Short Read Archive 
(SRA, http://www.ncbi.nlm.nih.gov/sra), 
which is offered as a free service to re- 
search groups. 

Given the significant costs that can re- 
sult from storage systems, data manage- 
ment plans should be drawn up at the early 
stages of the experiment design and the 
submission of proposals involving the cre- 
ation of large amounts of next-generation 
sequencing data. The length of time re- 
quired for completion, and the extent of 
the data analysis to be performed as part 
of the funded project, represent major 
factors in setting a data retention pol- 
icy and planning for storage costs. This 
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can also help to determine whether, and 
for how long, the images, sequence reads 
and sequence assemblies should be re- 
tained for the particular goals of a specific 
sequencing project. As a rough guiding 
principle, stage one raw image data may 
be kept for up to two weeks until the qual- 
ity checks, base calling, and conversion 
to sequence reads are complete. In con- 
trast, stage two sequence reads, and their 
stage three derivatives (assembly, annota- 
tion, etc.), must be retained anywhere from 
one year until the results are published, or 
even up to a decade, if the research in- 
stitute acts as a resource center for the 
scientific community. For example, the 
original plan for the 1000 Genomes Project 
(http://www.1000genomes.org) was for 
research groups submitting sequenced 
genomes to include all generated files. 
The average storage requirement for both 
the images and sequence reads was es- 
timated at about 50 bytes per sequenced 
base. However, later in the project the 
participating members decided to submit 
only reads along with some metadata re- 
garding their quality, which reduced the 
requirement to a single byte per base. 

An alternative option to investing in 
informatics infrastructure, would be to 
obtain storage capacity from a ‘‘cloud com- 
puting” service. Cloud services provide 
research groups with the ability to store 
data and to perform bioinformatic anal- 
ysis on an essentially unlimited pool of 
Virtual Machine (VM) servers, without the 
groups owning or maintaining any com- 
puter hardware. The charge model used 
by cloud service providers is similar to that 
used by utilities such as electricity, and cus- 
tomers are billed based on the amounts of 
computational resources consumed. This 
system may operate better for smaller re- 
search laboratories, as it avoids the need to 
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invest in computer hardware and data cen- 
ter infrastructure, the cost of which cannot 
be justified for just a few experiments. For 
example, at the Amazon Simple Storage 
Service (S3, http://aws.amazon. com/s3), 
prices range down from US$ 0.14 per 
stored GB per month (as of the first quar- 
ter of 2011). This price applies to the first 
1,000 GB of storage used, after which there 
is a tiered charge that allows the cost to 
be reduced as the usage increases. For 
example, an additional 49,000 GB would 
cost US$ 0.125 per GB per month. 

An important concern when exchang- 
ing sequence data among collaborating 
research centers, using either cloud-based 
storage or downloading data from public 
databases, is the data transfer bottleneck 
that occurs across the Internet. According 
to results reported by the Amazon cloud 
service, when using an average Internet 
connection 600 GB of data would take ap- 
proximately one week for transfer, while 
using a high-speed line would require 
about two days. One other option offered 
by the Amazon service for them physi- 
cally to ship the disk drives to the com- 
pany’s local branch (http://aws.amazon. 
com/importexport), and have the data 
copied directly to the cloud servers; this 
would cost US$ 80 for disk drives con- 
taining up to 4 TB of data. Despite the 
data transfer bottleneck, data storage us- 
ing a cloud service has the advantage that 
large-scale sequencing datasets can be eas- 
ily exchanged among collaborators on a 
worldwide basis. Inherent in the design of 
Amazon S§3 service is a replication of data 
across several physical storage locations 
for disaster prevention; this is currently 
available on US East and West regions, 
European Union (Ireland), and Asia Pa- 
cific (Singapore). For the data upload, the 
research team can choose the closest re- 
gion to minimize data transfer latency 


over the Internet. The cloud service can 
then automatically replicate the data to 
different locations as part of the disaster 
prevention policy, thus allowing collabo- 
rating research groups to download the 
data from their closest region. 


6 
Future Perspectives 


Currently, an evolving knowledge of the 
microbiome continues to provide a greater 
understanding of the vital role of baseline 
commensal microbes in the various de- 
fined environments of the human body. 
The focus of current microbiome stud- 
ies is to assess the totality of microbes 
and their genomes to determine the role 
of host-microbe and microbe—microbe 
interactions. Consequently, it will be pos- 
sible to expand the present knowledge of 
the microbiome’s influences in shaping 
the host ecosystem and immunity, while 
evaluating their importance in the occur- 
rence of disease and the maintenance of 
a healthy homeostasis. This knowledge 
could, potentially, have far-reaching impli- 
cations in the diagnosis of disease risk and 
disease pathogenesis, and also in the de- 
sign of microbiome-focused therapeutics 
and “personalized medicine.” 
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Keywords 


Boolean network 
A set of nodes which have two possible states (1 or ON and 0 or OFF) and a list of 
Boolean functions, each of which are assigned to a node. 


Biologically relevant Boolean functions 

A Boolean function specifies, at each time step, the activity of the node it is assigned to, 
given the activity at the previous moment of some other nodes linked to it as inputs. 
Among all the theoretically possible Boolean functions, there are some that can be 
biologically implemented, while others are biologically meaningless. 


Network inference 
Boolean functions can be inferred from experimental expression data. 


Probabilistic Boolean network (PBN) 

PBN models represent a tool to take into account the uncertainty and stochasticity of 
real regulatory networks. Instead of assigning just one Boolean function to each node, 
it is possible to assign a set of functions together with a probability for each function to 
contribute. The dynamics of such a network is termed a Markov chain. 


Rate equation 

An ordinary differential equation that, in the context of continuous modeling of 
molecular regulatory interactions, defines the temporal rate of production of a certain 
regulated substrate as a function of the concentrations of the ensemble of regulators 
affecting it. 


AND/OR logical functions 

In Boolean terms AND and OR are the simplest logical functions of two Boolean input 
variables. In an AND logic, the output is only true when both the inputs are true in an 
OR logic it is sufficient for the output to be true that, at least, one of the inputs were 
true. 


Network motif 
A small biomolecular circuit of a low number of substances (typically 2—4) interacting 
between them, according to a certain regulatory topological scheme of regulations 


(typically 2-6). 
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During the past few years, whilst the capacity to produce large amounts of 
experimental data has increased dramatically, the ability to analyze these data has not 
progressed at the same pace. Thus, there is a risk that this sea of acquired data will 
become overwhelming, and that no meaningful theoretical or phenomenological 
insights will be extracted from it. One way to overcome such a problem would be to 
study the interactions among a system’s constituents as a network, involving both 
the structure and dynamics of that system. In this chapter, the main approaches 
used to model the dynamical behavior of biomolecular, regulatory gene networks 
are revised, whereby two different approaches are discussed and analyzed: (i) the 
discrete case, as represented by Boolean functions; and (ii) the continuous case, 
which is based on differential equations. For both scenarios, the most recent results 
are outlined, and details provided of the many variants that can be adopted. Finally, 


the future lines of research in this field are proposed. 


1 
Introduction 


This chapter deals with networks at the 
level of the cell, which is the basic unit 
of any living organism. In the cell, all 
types of structural and functional pro- 
cesses are ruled by the intricate interaction 
of genes, proteins, and other molecules. 
At this point, due to limitations of space, 
the topological properties of molecular 
networks, or the most common metrics 
used to characterize the system, will not 
be discussed. Rather, the reader is re- 
ferred to specific reports of networks (e.g., 
Refs [1, 2]), the dynamics of which will be 
focused on here. 

Regulatory mechanisms among genes 
can be translated into mathematical lan- 
guage in various ways. The architecture of 
the cell implies a set of dynamical con- 
nections among its components that must 
be translated into a set of mathematical 
equations that captures the temporal and 
spatial evolution of the system. The appro- 
priate choice of the dynamical equations 
will depend on the level of description 
required. In this sense, large-scale gene 
regulatory networks are usually described 
that make use of Boolean functions. On 


the other hand, when the aim is to 
describe simple regulatory mechanisms 
that involve very few genes, more detailed 
models — such as nonlinear differential 
equations — are best suited. Thus, the use 
of each type of description will depend on 
the sum of complexities with regards to 
the structure and dynamics of the system. 

The development of quantitative models 
to capture the coordinated behavior of the 
circuitry of interacting molecules allows, 
ultimately, the physiological properties of 
the cell to be determined. For instance, 
the use of ordinary, partial, and stochas- 
tic differential equations has allowed the 
tracking and prediction of how quickly 
each component of a biochemical network 
changes with time. The extrapolation of 
differential equations characterizing the 
dynamics of small subsystems (such as 
gene circuits) to include an increasing de- 
gree of complexity may render the model 
prohibitively complicated. In such situa- 
tions, or when many rate constants are 
unknown, it may become necessary to 
resort to dynamical systems theory as a 
reliable approach to the solution of the 
differential equations. Steady states, limit 
cycles, and other dynamical configura- 
tions can provide quantitative information 
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regarding changes in the magnitude and 
direction of the state variables. This type of 
analysis can be translated into the physio- 
logical states of the cell (e.g., stable steady 
states are associated with checkpoints). 

Finally, the main drawback of the dif- 
ferential equation-based approach is the 
need to recognize the (very rarely avail- 
able) kinetic details of the molecular and 
cellular interactions. There is increasing 
evidence, however, that the input-output 
curves of many regulatory relationships 
are strongly sigmoidal, and can be ap- 
proximated by step functions. In addition 
to this, regulatory networks often main- 
tain their function even when faced with 
fluctuations in components and reaction 
rates. This allows the implementation of 
coarse-grained methods, such as Boolean 
models, that are also widely used in sys- 
tems biology research. 

Both approaches — namely, continuous 
and discrete modeling — are discussed in 
the following sections. More precisely, 
within the family of discrete models, at- 
tention will be focused on the simplest 
case of Boolean approaches while, with re- 
gards to continuous methods, approaches 
based on ordinary differential equations 
(ODEs) will be reviewed. Whilst these are 
only two possibilities of a wider family of 
models, they may help to provide an under- 
standing of the main features of regulatory 
systems dynamics. It should be noted that 
this review is by no means exhaustive, as 
limitations of space have left no option but 
to make a choice. 


2 
Boolean Dynamics Models 


Boolean networks have experienced an 
extensive history in modeling biological 
systems since the pioneering studies of 


Kauffman and Thomas during the early 
1970s [3, 4]. These early investigations 
focused on the generic properties of net- 
works and their connections with the be- 
havior of living systems, and the evolution 
of biological networks. The “‘post-genomic 
revolution” — which has been based on 
new experimental techniques and the con- 
sequent availability of large amounts of 
data — has led to a resurgence in dynami- 
cal Boolean network modeling, as applied 
to the study of the dynamics of real bio- 
logical networks [5]. Despite the great level 
of abstraction and simplifications of these 
models, they are able to provide answers 
to different questions, and also from more 
theoretical questions, concerning the ori- 
gin of life and its evolution [6], and also 
to much more practical questions related 
to the actual behavior of real organisms 
and drug target identifications [7]. As will 
become evident later, there are two main 
approaches to extracting knowledge froma 
Boolean network model of regulatory net- 
works. The first approach is to construct 
random Boolean networks to study and in- 
terpret their generic properties and relate 
them to known biological features. A sec- 
ond approach involves directly inferring 
particular structures of biological networks 
from experimental expression data. In this 
way, it is possible to reveal more detailed 
information on the specific organisms un- 
der study, and hence to design therapeutic 
interventions [8]. 

The modeling of gene regulatory net- 
works via Boolean networks must take into 
account the fact that cellular processes oc- 
cur within a very noisy environment, and 
use very unreliable elements. The key is 
then to include an increasing complex- 
ity in the models, and later to question 
its robustness. The possibility of Boolean 
network modeling relies on the sigmoidal 
nature of the regulatory interactions, so 


that the step-function approximations are 
natural: the gene product is either absent 
(below a given threshold) or present, and 
the gene is either off or on, leading to a 
“logical’’ description of the interactions. 
This simplification is the elementary ver- 
sion of a class of idealization that relies on 
discrete approximations of the nonlinear 
regulatory interactions. Other approxima- 
tions are also possible and actually utilized, 
for example piece-linear approximation [9]. 
However, the Boolean network model is 
the best formally studied model system, 
and the model for which a wide range 
of reverse engineering algorithms exists. 
Further details of this form of modeling 
approach are provided in the next sections. 


2.1 
Boolean Formalisms 


A Boolean network G(V,F) is defined by a 
set of nodes V = {x;,..., X,} which have 
two possible states (1 or ON, and 0 or 
OFF) and a list of Boolean functions 
F = (f;,..., f,). A Boolean function fj 
specifies the activity of the node x; at each 
time step, taking into account the activity at 
the previous moment of some other nodes 
linked to it (the inputs). In a biological 
context, the value of x; represents the state 
of the expression of gene i (1 when it 
is expressed and 0 otherwise), while the 
list of Boolean functions corresponds to 
the set of regulatory interactions among 
genes. The dynamics of a Boolean network 
can be updated either synchronously or 
asynchronously, depending on whether 
all the nodes are updated at the same 
time or not, respectively. As will be 
seen later, the description of attractors 
sensibly depends on the updating choice 
(at the moment, attention is focused on 
synchronous updating). 
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The dynamics can be represented in the 
form of a directed graph, so that each 
node represents a state of the network, 
while a link between two nodes repre- 
sents the possible transition from one to 
another, according to the list of Boolean 
functions. As the dynamics is determin- 
istic (recall that synchronous updating is 
being assumed), each node can have only 
one output link, and consequently a path 
in this graph would be analogous to a tra- 
jectory in continuous models. In addition, 
due to the deterministic nature of Boolean 
networks and the finite number of states, 
there are states which are repeatedly vis- 
ited. Such states, which are referred to as 
named “attractors,” can be fixed points or 
cycles, while the transient states that lead 
to an attractor are termed its basin of at- 
traction. Hence, a complete description of 
the dynamical properties of a Boolean net- 
work implies knowledge of the attractors 
and their basins of attraction — that is, of 
the state space. An important parameter, 
as will be seen later, is the cycle length; 
this is the number of different states that 
will be visited before returning to the orig- 
inal state in an attractor. Finally, it should 
be noted that Boolean network models are 
examples of finite dynamical systems, and 
can be generalized to finite fields [10]. 


22 

Generic Properties of (Random) Boolean 
Networks and Cell Behaviors: Cell 
Differentiations and the Cell Cycle 


The intuition of Kauffman, in his pioneer- 
ing studies of gene regulatory networks, 
was to relate the dynamical properties 
of (random) Boolean networks to cellu- 
lar types [11]. In other words, the goal of 
this approach was to interpret different cel- 
lular types in terms of different attractors 
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of the regulatory network. Another inter- 
pretation of the attractors of a Boolean net- 
work is that they may represent different 
functional states of a cell (differentiation, 
growth, quiescence, and apoptosis) [12]. 
Taking together these two interpretations 
allows the developmental processes to be 
thought of in terms of an exploration of the 
attractors in the state space. Admittedly, it 
is possible to mimic the developmental 
process of an organism, thus simulating 
the dynamics of a Boolean network. 

The so-called ‘“‘“NK-model” of Kauffman 
is based on a random Boolean network 
with N fixed nodes and K fixed inputs 
per node. The random points here are the 
wiring of the network, and the Boolean 
rules that are assigned to each node. Ini- 
tially, Kauffman began to study such net- 
works with the aim of gaining insights into 
the mechanisms of regulatory networks, 
and identified very interesting behaviors 
beyond a mere biological context. As K 
decreases from N to 1, there is a phase 
transition from chaos to order. However, 
for K > 2 the dynamics is chaotic, whereas 
for K < 2 itis ordered. Complex dynamics 
emerges during the transition region. 

In order to characterize the chaotic and 
the ordered phases, two properties are 
scrutinized: (i) the expected length of the 
cycles; and (ii) the sensitivity to the initial 
conditions. In the chaotic regime, the 
cycles quickly become very long (~2/?), 
so that they have no biological sense and 
the system then requires too much time 
to explore its attractors (the number of 
which scales as N). In contrast, in the 
ordered regime the cycle lengths and 
the number of attractors are both N!/2, 
such that the network has a relatively 
large number of attractors that can be 
explored within a relatively short time. 
This led to the interpretation by Kauffman 
of the attractors as cell differentiation. In 


the case of the second descriptor, within 
the chaotic regime a small perturbation 
produces a cascade that propagates along 
the network, affecting the dynamics in 
random fashion. However, in the ordered 
regime a perturbation is rarely produced 
that causes the network to leave the 
attractor on which it is seated (when 
this is the case, the network falls in a 
near attractor). Thus, a type of stability is 
achieved that also allows for the possibility 
of exploring near attractors (this is the 
second interpretation). 

The actual mechanism at the root of the 
chaos-—order transition is the formation 
of percolating frozen cores of elements 
fixed in one of the two possible states, 
1 or 0. In the chaotic regime, there are 
islands of frozen cores in a connected 
sea of oscillating elements, which explains 
the chaotic dynamics. In contrast, in the 
ordered regime the frozen cores have 
percolated while the functional islands 
of oscillating elements have remained, 
but cannot influence each other; this 
gives rise to the observed stability. At the 
transition region, the frozen cores begin 
to percolate, allowing the emergence of 
complex dynamics. 


2.3 
Topological and Dynamical Properties: 
Homeostasis, Flexibility, and Evolvability 


Structural stability is a central concept in 
dynamical systems, and has its biological 
counterpart in the concept of homeosta- 
sis—the need of an organism not to be 
destroyed by ‘‘small” changes in the en- 
vironment it is in contact with. In other 
words, homeostasis represents the capac- 
ity of the organism to self-sustain its 
metabolism and developmental processes. 
Whereas homeostasis accounts for the ca- 
pacity of a living system to sustain itself, 


flexibility captures whether organisms are 
able to develop different functional states 
through epigenetic processes in a vary- 
ing environment. This latter property may 
be related to multi-stationarity [13] — that 
is, the presence of different stable attrac- 
tors among which the system can choose, 
meaning that the organism would have 
a wide range of responses to changes in 
environmental conditions. 

As hypothesized by Thomas [14], and 
successively demonstrated [15, 16], the 
above-described properties can be placed 
in strict relation to the topological proper- 
ties of the underlying regulatory network, 
where the key role is played by feedback 
(FB) loops. In the words of Thomas, in 
a FB loop each element exerts an influ- 
ence on the evolution of all elements in 
the loop, including itself. In a “positive 
loop” the elements are positively influ- 
enced, whereas in a ‘negative loop” the 
elements are negatively influenced. If a 
sign is assigned to these regulatory in- 
teractions —a plus sign (+) to a positive 
regulation and a minus sign (—) to a 
negative regulation— then the sign of 
a loop will be determined by the parity 
of the negative interactions. To maintain 
homeostasis, the presence of at least one 
negative loop is required, whereas to en- 
sure multi-stationarity at least one positive 
loop is required. The power of those state- 
ments lies in their transversality, as they 
were demonstrated in many different con- 
texts, from differential models to Boolean 
models. Consequently, in order to discover 
genes involved in switching differentiation 
processes it is necessary to examine genes 
involved in positive loops. In addition, the 
different dynamical behaviors associated 
with the two different topologies (posi- 
tive and negative loops) provide a rational 
means of decomposing any complex reg- 
ulatory networks. In other words, loops 
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can be used as a set of building blocks to 
analyze a complex network, or to construct 
one with given properties. 

On examining the transcriptional reg- 
ulatory (TR) network of Escherichia coli 
metabolism, other topological properties 
can be found that might account for home- 
ostasis, flexibility, and evolvability. In fact, 
the network can be coupled to the environ- 
ment simply by adding to the network any 
nodes that represent the external metabo- 
lites [17]. The resultant network is an 
acyclic directed graph [18], organized in 
a hierarchical manner, with the external 
metabolites being the root and metabolic 
genes (as they have no outgoing links) 
being the leaves. Between these layers 
are located the transcription factors (TFs). 
On analyzing the dynamics, it becomes 
clear that: there are only fixed point at- 
tractors; the basin of attraction of each 
attractor is the entire state space; and 
that the attractors’ configuration depend 
upon the environmental conditions [17]. 
On the other hand, the hierarchical acyclic 
structure accounts for homeostasis and the 
control is located in the root [19]. There- 
fore, the configuration of the root forces 
the configuration of the TFs which, in turn, 
determines the configuration of the leaves. 
Moreover, due to the acyclic structure, the 
fixed point is stable. 

If the leaves are removed from the net- 
work, then the resulting graph will be 
strongly disconnected and organized in a 
modular manner. The modules are not di- 
rectly connected but rather influence the 
common leaves, which accounts for the 
flexibility [17]. Finally, the same modular 
structures relate flexibility to evolvability. 
Due to the separation among modules, 
changes (e.g., resulting from a muta- 
tion) in a certain module will affect only 
the dynamical properties of this module 
but not the remainder of the network. 
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Consequently, the organism can explore 
new niches without affecting its home- 
ostasis. The previous result appears to be 
in contrast with the assertion of Kauff- 
man, that networks must remain close to 
the transition to chaos in order to evolve, 
whilst this network is found deep in the or- 
dered behavior. This apparent discrepancy 
does not really exist, however, because in 
one model the external factors are directly 
involved, whereas the other results refer 
to autonomous systems. It can be said 
that the organization of the attractors is 
a function of the autonomous network, 
while the changes in the environmen- 
tal factors unfreeze the frozen genes and 
move the network to a new attractor. This 
mechanism represent an alternative to the 
“edge of chaos’ hypothesis proposed by 
Kauffman. 


2.4 
Biologically Relevant Boolean Rules 


In real regulatory networks, not all of the 
Boolean rules have the same probability. 
Rather, it appears that there are biologi- 
cally relevant rules that form a small subset 
of all theoretically possible functions, and 
that these are related to the robustness of 
the network. It was first noted that these 
relevant rules are canalizing functions [6]; 
the latter is a rule where an input alone 
can determine the output, while the others 
concur to determine the output only if the 
canalizing input is in the noncanalizing 
state. A further investigation [20] showed 
that the large majority of biologically rel- 
evant rules are not simply canalizing, but 
hierarchically are canalizing. In a hierar- 
chical canalizing function, all inputs are 
essential in a hierarchical manner; that is, 
the second canalizing input is canalizing 
if the first one is in a noncanalizing status, 
and so on. 


It has been shown that a Boolean 
network with particular subclasses of hi- 
erarchical canalizing functions demon- 
strates an ordered behavior; hence, these 
types of Boolean rules account for ro- 
bustness [21]. On analyzing a measured 
yeast transcriptional network, Kauffman 
and colleagues showed that, for the ensem- 
ble of generated models, those networks 
with canalizing functions were remark- 
ably stable, whereas those with arbitrary 
Boolean rules were only marginally stable 
[22]. This was an expected result, since 
real organisms live in a highly noisy en- 
vironment, but must still preserve their 
state. Besides, canalizing functions are 
much realistically realizable than random 
functions. An example of this is the “ex- 
clusive or’ rule, where the output is 1 if 
one of the inputs is 1, but is 0 if both 
inputs are 1. This is evidently unrealis- 
tic for a real regulatory interaction, and 
consequently Boolean networks with forc- 
ing rules will show an ordered behavior. 
This also occurs because forcing rules in- 
crease the probability of the formation of 
forcing structures; sub-circuits in which 
a canalized state will propagate to linked 
elements despite the initial condition [6, 
22]. Thus, forcing structures favor stability 
and, perhaps also the emergence of fixed 
points. 

The aforementioned properties suggest 
that it would be best to adopt not a com- 
pletely random distribution in modeling 
a real network, but rather a distribu- 
tion of hierarchical canalizing functions. 
However, some groups have provided an 
alternative meaning to the notion of bi- 
ologically meaningful Boolean rules [23], 
which takes into account that not all bi- 
ologically relevant functions are forcing 
and that not all of the canalizing functions 
are biologically realizable. The meaningful 
functions are selected taking into account 


the inhibitory or activatory roles of each 
input. Clearly, the majority of these func- 
tions are canalizing, while the dynamical 
effects are the same as if just canalizing 
functions were used. Finally, it is impor- 
tant to note that the rule sum is used in 
many dynamical simulations: 


Vii the O>Se+)=1 


edith s0>Si¢+ 1 =0 (1) 
where Ji is “+1” if the link represents 
a positive regulation, ‘“—1” for inhibitory 
regulations, and h is a threshold. This rule 
accounts only for the inhibitory/activatory 
nature of the interactions. 


2:5 
Dynamical Simulation: An Example 


The yeast cell cycle is one of the best known 
networks, and provides a good test for a 
Boolean model. Such a model has been 
proposed by Li et al. [24], who constructed 
a network of key known regulators. The 
network had 11 nodes representing pro- 
tein states, plus an external signal and two 
types of link: positive and negative, where 
the latter represented inhibition, repres- 
sion, or degradation. The protein states 
were updated at each time step according 
to the sum rule: 


De sii RO FSiE+D=1 
DSi <9 FSi E+) =0 


sii = 0S Si+D = SO (2) 


where Ji = Jy for positive links, and 
Jij =Jn for negative links. The authors 
also added self-links that represented the 
degradation of nodes that were not neg- 
atively regulated by others nodes. This 
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very simple model was based only on 
the wiring diagram inferred from qual- 
itative experimental knowledge, and on 
the positive/negative nature of the inter- 
action. The predictive power of this type 
of model does not lie in the accurate pre- 
diction of the expression dynamics, but 
rather on the attractors’ picture — that is, 
the description of the network’s states and 
their relations through dynamical transi- 
tions. 

An analysis of the dynamics of the 
yeast cell-cycle network showed that all 
of the initial states flowed into one of 
seven fixed points. Strikingly, one of 
these fixed points attracted 86% of the 
initial configurations, and was remarkably 
related to the biological stationary state 
G, in which cell growth occurred. Further 
exciting the G; state by turning on the 
signal node induced the network to follow 
a path in the phase space, which ultimately 
returned again to G;. This path was also 
related to a biological path, and also 
possessed a very stable trajectory, in the 
sense that any nonbiological state would 
converge to the biological path. 


2.6 

Boolean Networks Inference from 
Experimental Data: Probabilistic Boolean 
Networks 


Today, microarrays continue to generate 
the huge amounts of data that are now 
available. The interest in inferring net- 
works from real experimental data derives 
above all from a need to understand ge- 
netic regulation in specific organisms, in 
order perhaps to develop therapeutic in- 
terventions in diseases such as cancer or 
bacterial infections. Due to the great un- 
certainty of this type of data, however, and 
to a number of underlying latent factors, 
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reverse engineering is a very difficult 
task. 

During the past few years a vast amount 
of analytical and computational tools have 
been developed to infer Boolean networks 
from experimental data, with the choice 
among such tools depending on the type 
of data at hand and the goal of the model 
to be constructed [25, 26]. As noted above, 
within a Boolean network each gene state 
is determined by some other genes, by 
means of a Boolean function. Hence, the 
major task is to design Boolean functions 
(predictors in estimation theory) from the 
data. The general concept here is that, 
given a target gene Y, and a set of input 
genes Xj, Xz, ...,X, the optimal predictor 
of the target Y, based on the prediction 
variable Xj, ...,X, is that with the minimal 
error according to some probabilistic error 
measure. In practice, however, given the 
expression level of Y, the optimal predictor 
is that which better predicts it given the 
expression levels of genes X1, X2, ...,Xn- 
The theoretically optimal predictor Y is 
unknown and must be estimated (for a 
detailed analysis, see Ref. [27]). 

In practical cases, due to errors and ex- 
ternal factors, is often necessary to settle 
just for Boolean rules that minimize the 
number of mis-classifications. In this case, 
an important role is played by the con- 
straints that can be imposed, including a 
prior knowledge of the biologically relevant 
functions, such as the canalizing functions 
[26]. The main problem with this type of 
inference is the inherent determinism of 
Boolean rules when compared to the un- 
certainty shown by the data, such as: the 
stochastic nature of gene expression; the 
experimental noise; and possible interact- 
ing latent variables. Such a problem may 
lead to a Boolean function designed on the 
sample data, but which may be unable to 


make predictions when confronted with 
different conditions. 

One natural approach to remedy this 
would be to include uncertainty in the 
model. Specifically, a number of simple 
functions (those that have just a few 
inputs) are inferred, such that each of them 
has a chance to contribute. This approach 
led to the formulation of Probabilistic 
Boolean Networks (PBNs) [24] whereby, 
starting from a given state, the network 
has a given probability to jump to some 
other states according to the Boolean rules, 
and their probabilities to contribute to the 
flow between different states. The main 
advantage of this approach relies on the 
fact that these networks are strictly related 
to the framework of Markov chains, so 
that a whole body of rigorous results can 
be used. 


2.7 
Addition of Noise 


In order to mimic random perturbations 
in the framework of Boolean networks, 
a node is typically flipped arbitrarily to 
its opposite state. However, as this prac- 
tice is quite unrealistic, an alternative ap- 
proach - consisting of extending Boolean 
models to be time continuous and stochas- 
tic [7] - has recently been proposed. One 
way to achieve this would be to add an 
inner concentration dynamics inside each 
node, keeping the discrete (1 or 0) for- 
mulation for its relation with other nodes. 
The input signal of each node would drive 
the growth or decay of a concentration 
variable c;(t) assigned to each node. In 
addition, an explicit time delay tg, which 
mimicked the transition time in a real 
extended system, could be added. Stochas- 
tic biochemical noise can be additionally 
incorporated by allowing the delay to fluc- 
tuate. In this way, by using the sum rule 


for the input signals, it is possible to write 
the following differential equation for the 
dynamics: 


iS (t— tg) +h; = 0 


> Ben 4 i(t 

ae eee) 

dE JaSi t= ta) + hi <0 
> dei(t) _ 9 i(t 3 
ee (3) 


It should be noted that, by introducing 
a threshold rule, the binary output is 
recovered as: 


cH >TSSOH=1 
cit)<T>S;®H=0 (4) 


As noted above, noise can be added at 
the delay tz > ta+ x, where xj is a ran- 
dom variable assigned to each link Jj. 
In this case, the nodes are no longer 
synchronously updated and, as a conse- 
quence, most of the attractors disappear, 
revealing their artificial nature due to 
synchronous updating. Finally, it should 
be noted that the noisy version of the 
yeast model presented previously is robust 
[28]. 


3 
Continuous Dynamics Models 


The potential of Boolean approaches in 
the context of genetic regulatory dynam- 
ics modeling lies in the versatility that the 
strong assumptions beneath the method 
confer to it. Not for nothing (in the Boolean 
formalism) can a gene be considered only 
in two states of ON or OFF, being any inter- 
mediate, quantitative possibility, and nec- 
essarily neglected. In the cases in which 
this hypothesis cannot be considered as 
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a reasonable statement, certain important 
dynamical aspects of the systems are not 
captured by this type of Boolean method. 
Consequently, there is a need for more 
general modeling tools that could success- 
fully describe a wider range of dynamical 
behaviors. Historically, the main tool used 
to describe chemical kinetics under ge- 
netic regulatory processes is none other 
than the ODE, on which attention will be 
focused in Sect. 3.1. 

This type of formalism considers the 
concentrations of DNAs, RNAs, and pro- 
teins that are involved in a genetic circuitry 
as dynamical variables, the temporal and 
continuous evolutions of which can be de- 
fined by a rate equation that is essentially 
an ODE: 


p=f(P) (5) 


In this type of equation, the variable 
pi represents the concentration of the 
i-th product involved in the regulatory 
network: it is either a protein, a RNA, 
or any other substance. The components 
of the vector p represent the whole set of 
concentrations of products on which the 
temporal evolution of p; depends. The rate 
function f is highly nonlinear. 

Once the different mathematical 
possibilities that can be adopted (always 
using ODEs) to model genetic regulatory 
interactions have been characterized by 
essentially modifying the rate function, 
f, the next stage is to further analyze 
the dynamics of the network motifs. 
Motifs are small structures composed of 
a reduced number of genes (proteins; 
typically between two and four) that 
regulate each other. Although the global, 
genome-wide topological patterns of 
regulatory networks are highly complex 
[29, 30], when examining the small scale 
of these systems the network motifs 
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appear as modular devices, the dynamics 
of which can be characterized (at least as 
a general exercise) as if they were isolated 
systems [31-33]. Complementary results 
derived either from dynamic modeling, 
experimental procedures, and/or bioin- 
formatics techniques conducted during 
the past years have supported the former 
proposal. In this sense, a series of inves- 
tigations has been undertaken to model 
the function of several network motifs 
by successfully comparing experimental 
results with numerical data derived from 
dynamical modeling [34-40]. These 
findings sustain the thesis that (at least in 
the cases studied) the dynamics of these 
small genetic circuits should be associated 
with simple information-processing 
tasks such as noise reduction, sequential 
programming, logic operations, bistable 
switching, and pulse generation. Based 
on this point of view, network motifs 
may consist of “functional bricks” of 
the biological regulatory networks, in a 
similar way that transistors, diodes, and 
amplifiers function in electronic circuits 
[41]. In addition to the above-mentioned 
findings, several studies have confirmed 
that the number of network motifs of 
each type present in biological networks 
is anything but random [42]. To date, all 
of the regulatory networks studied have 
been alike with regards to having similar 
motifs, when compared to adequately 
defined null models. These results suggest 
that the concept of motifs as elementary 
information-processing devices may have 
Darwinian consequences, in the form 
of a certain evolutive pressure that is 
responsible for the different levels of 
significance observed for different motifs. 
These types of question, some of which 
remain unanswered, are discussed in 
Sect. 3.2. 


3.1 
ODE Formalisms: From Biochemistry to 
Mathematics 


In this section, the details of the main 
families of ODE-based formalisms to 
model genetic regulatory systems are re- 
vised. At this point, it may be beneficial 
to emphasize the conceptual difference 
that exists between the two typical fami- 
lies of models reported in the literature. 
As a first step (see Sect. 3.1.1),the first 
types of model (grouped as “‘biochemi- 
cal background-based’”’ models) are dis- 
cussed, the main characteristic of which is 
their dependence on the precise biochem- 
ical mechanisms that eventually will drive 
the biomolecular processes. These mecha- 
nisms are translated into specific hypothe- 
sis that are contained in the equations. In 
turn, a rich variety of substantially differ- 
ent phenomenologies translates into very 
different dynamical behaviors. As will be 
discussed, the main problem regarding 
this type of model is that details of the 
underlying precise biochemistry are not 
always known, and consequently it is dif- 
ficult to perform useful simulations. More 
precisely, not knowing the rate constants 
that appear in the equations will always 
create a bottleneck when comparing exper- 
imental data and numerical results derived 
from model predictions. It is also impor- 
tant to stress the value of distinguishing 
between the dynamical implications re- 
lating to the different types of regulatory 
interaction (such as TR interactions) and 
protein-protein (PP) interactions (such as 
phosphorylation and dephosphorylation). 
In Sect. 3.1.2, some alternative ap- 
proaches will be revised that, from a more 
empirical point of view, do not incorporate 
sufficient detailed biochemical knowledge 
to reproduce the qualitative dynamics of 


those regulatory systems that have been 
characterized experimentally. 


3.1.1 Biochemical Background-Based 
Models 

As noted above, the family of models de- 
scribed in this section treat the concentra- 
tions of substances involved as dynamical 
variables that evolve, according to the regu- 
lations that each substance might receive. 
The temporal evolution of these concen- 
trations follows rate ODEs with the same 
general form of Eq. (5). Without abandon- 
ing this scheme, these models share the 
same spirit, namely, the idea of select- 
ing the precise form of the rate functions 
f, depending on the precise biochemical 
mechanisms that drive the regulatory in- 
teractions. Having said that, itis also worth 
highlighting the differences between the 
two types of regulatory interaction that 
are better characterized in the literature, 
namely TR and PP interactions. As will 
be seen, the dynamics associated with 
these groups of regulatory interactions are 
not exactly equivalent. More importantly, 
these divergences are not always exhaus- 
tively taken into account [31, 32]. 


Dynamic Modeling of Transcriptional Reg- 
ulatory Interactions In the most simple 
picture, a TF is a protein that is capa- 
ble of recognizing precise sequences of 
DNA (i.e., promoters, or even better, target 
DNA regions within promoters) that are 
very close to the points where the transcrip- 
tion of regulated genes begins. By binding 
the DNA at these specific regions, the 
TF essentially modifies the chemical affin- 
ity of RNA polymerase (RNAp) to these 
transcription starting sites of the genes 
under regulation. Leaving aside the details 
of these catalytic mechanisms, there are 
two important points to start the analysis: 
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e The TF regulates the production of 
the substrate protein S encoded by the 
gene s. 

e In order to do achieve this, the TF 
does not interact directly with protein 
S; hence, kinetically the regulation dy- 
namics does not depend on the concen- 
tration of the regulated substrate S. 


The simplest form to describe the 
influence of the TF, T, on protein S, is 
then: 


S = kif (T) — koS (6) 


In Eq. (6) the first term clearly refers to the 
regulatory effect of the TF T on variable S 
that, as noted previously, does not depend 
on S. In order to represent an activation 
(inhibition), f(T) must be a monotonically 
increasing (decreasing) function in T. In 
turn, the second term corresponds to the 
usual degradation process of the protein 
[43, 44]. The rate constants k, and kz are 
always positive. 

One candidate frequently found in the 
literature [45] is the so-called Hill function, 
which takes the form: 


H 
+ = ee 
H™(T,@, H) = TA ypoa (Activation) (7a) 
ot eevee 
H7(T,0, H) = Tia oH (Inhibition) (7b) 


In its physical domain, both functions are 
always positive and bounded: 0 < H* <1 
and 0 < H~ <1. In addition, the activa- 
tion (inhibition) function is monotonically 
increasing (decreasing) with the concen- 
tration of the TF T. The sigmoidal shape of 
these curves has the mathematical prop- 
erties desired to describe the regulatory 
effects in each case, is in good agreement 
with experimental results as it has been 
known for more than 30 years [46, 47], 
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and also allows the difficult task of intro- 
ducing the interaction TF-DNA—RNAp 
directly into the model to be avoided. In 
Eqs (7a) and (b), the parameter 6 defines 
the half-maximal value, while the exponent 
H takes into account any eventual cooper- 
ative effects by defining the steepness of 
the sigmoid. Thus, the temporal evolution 
of S, as a function of the concentration 
T of a certain activator (inhibitor), can be 
expressed as: 


. TH ee 

S=k, THyon k2S (Activation) (8a) 
: gf 

S=k THaon k2S (Inhibition) (8b) 


Moreover, any delays due to transcription 
and translation processes can be taken into 
account by simply delaying the argument 
of Hill functions. The importance of such 
delaying effects has been characterized in 
detail [48, 49], though its relevance was 
found to be only relative [50]. 


Dynamic Modeling of Protein-Protein 
Regulatory Interactions With regards 
to protein functionality, on frequent 
occasions many proteins (in both prokar- 
yotic and eukaryotic cells) must suffer 
post-translational modifications me- 
diated by other enzymes, such that 
their configurations are modified and 
their functionalities altered. In this 
sense, the enzymes in charge of the 
catalysis of these modifications act, as 
a matter of fact, as protein regulators. 
Phosphorylation—dephosphorylation _ re- 
actions constitute the paradigmatic 
example of this type of PP interaction. The 
regulatory role of these interactions has 
been characterized, occasionally, in great 
detail. For example, in budding yeast, 
the cyclin-dependent kinases (cdKs) are 


enzymes which, by phosphorylating on a 
series of substrates (usually referred to as 
executory proteins; EPs) either activate or 
inhibit them. As shown previously, the 
periodic profiles of activation for these 
EPs, ultimately regulated by cdKs, is 
closely related to the correct development 
of the cell cycle [51]. 

The paradigmatic description of this 
regulatory mechanism (by virtue of which 
a certain enzyme regulates the activity 
of a certain substrate) is a chemical 
sequence of two processes. First, the 
enzyme and the substrate must meet at 
a rate k, to form a transient complex, the 
concentration of which will be denoted as 
C. This complex must then be broken 
to release the intact enzyme and the 
modified substrate, for example at a rate 
of k3. The concentrations of the enzyme, 
inactive substrate and active substrate are 
then denoted, respectively, as E,S, and 
Sq. In addition, a defective break of the 
complex can be observed that is incapable 
of causing the desired modification on 
S; this is produced at rate k,, such 
that: 


ky ky 
St+tE—S~ CS S,+E 

k 
C4548 (9) 


With regards to these chemical reactions, it 
is important to note that the regulatory role 
of the enzyme E (unlike the case discussed 
above, of the TF T acting on a certain DNA 
binding site) has two main factors: (i) it 
does not modify the total concentration 
of substrate (Sp = S + S,, which remains 
conserved during the process); and (ii) it 
depends heavily on the concentration of 
the substrate. This scheme is none other 
than the Michaelis-Menten (MM) model, 
the corresponding rate equations of which 


are as follows: 


S =—kSE+k,C (10a) 
C =k SE — (kp +k3) C (10b) 
E = —kySE + (ks +k) C (10c) 
S,=kC (10d) 


with the constraints E+ C= Er (as the 
enzyme can be found either free or 
combined with the substrate) and S+ 
Sq + C = Sr, as the substrate can, in turn, 
be found in its free form — either active 
or inactive — and also combined with the 
enzyme to form the transient complex, C. 
These relationships allow Eqs (10a) and (b) 
to be rewritten as an independent system, 
by substituting into them the following 
expressions: 


E=E§r-C (11a) 

S=Sr—S,-C (11b) 
which yields: 

§ =—-k,S(Er -OC)+k2C (12a) 

C = kyS (Er — C) — (kn +k) C (12b) 


The temporal evolution of the other two 
variables obey the expressions E=-C 
and $, = —C — S. The standard approach 
to MM dynamics [52] is to assume that the 
complex achieves its dynamical equilib- 
rium very quickly, after which the temporal 
variations on complex concentration can 
be neglected: C ~ 0. If it is assumed that 
the latter situation holds, then the sta- 
tionary concentration of complex C* can 
be easily obtained from the homogeneous 
version of Eq. (12b), yielding: 


ErS* 


Ca. 
S* + Ky 


(13) 


where Ky = (kz + k3) /kyis the so-called 
MM constant. Once this quasi-stationary 
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state is reached for the complex, the 
temporal evolution of the fractions of 
substrates is obtained as: 


k3 Ey S* 


S=-S,=- 
S* + Ky 


(14) 


which does not vanish. That is, even at this 
quasi-stationary approach, both the sta- 
tionary concentrations S* and ultimately 
C* are still dependent on time, even if the 
temporal derivative of C is small enough 
to be neglected in Eq. (12b). 

There is, however, an important 
problem related to the MM approach, 
as explained here, when reducing the 
unpleasant dynamics described in system 
(Eq. 10) to the easier dynamics reduced ul- 
timately to Eq. (14). If it is considered that 
the complex formation stage is very fast, 
then it is possible to set C ~ 0 for almost 
all occasions. However, it is in fact being 
assumed that two times scales are involved 
in the problem: (i) an initial fast phase that 
persists until the quasi-stationary concen- 
tration of complex is reached; and (ii) a 
stage in which the evolution of substrate 
concentration can be approximated by Eq. 
(14), while the complex concentration be- 
haves approximately according to Eq. (13). 
In this sense, to integrate Eq. (14) — which 
is only valid in the second phase — the 
initial conditions for S would be required 
at the start of the second phase. 

The usual approach to this problem 
involves considering the total amount 
of substrate S; as the initial condition; 
that is, to consider that the concen- 
tration of substrate S is not substan- 
tially modified during the transient phase. 
As the variation of S is due to the 
substrate—enzyme reaction to form the 
complex until quasi-stationarity is reached, 
it is assumed that, during complex for- 
mation, the “limiting reactive” is the 
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enzyme which, in mathematical terms, 
is none other than Er < Sr. This ap- 
proximation (which is sufficient in many 
experimental cases) must be reconsidered 
ifthe aim is to simulate complex regulatory 
networks, though it may not be a reason- 
able hypothesis for all PP interactions in 
the network. In those cases where the con- 
centrations of the enzyme (regulator) and 
substrate (target) are similar, the option of 
integrating (Eq. 14), using Sy as the initial 
condition, may not hold. A more detailed 
explanation of how to deal with this prob- 
lem (when present) is presented in the 
classical text of Murray [52]. 

The two types of regulation mech- 
anism - transcriptional and enzymatic — 
are clearly not unique possibilities. It is 
possible that the main example of protein 
functionality modification, which has not 
been directly mentioned here, is that due 
to ligand—protein binding. In this sense, 
it can be noted that certain proteins 
can bind with small ligands to form 
transient complexes, the functionality of 
which is different from the free protein 
case. Consequently, the ligand can be seen 
as the regulator and the protein as the 
substrate, the behavior of which has to be 
modified. The dynamics derived from this 
type of process can also be described using 
a MM approach that is very similar — and 
even simpler—to that described here. 
One very relevant case of proteins for 
which functionality can be regulated by 
the presence of small signaling ligand 
molecules, is that of the TFs themselves. 
The chemical affinity of a large quantity 
of TFs to their DNA consensus sequences 
depends on whether the TF is free, or is 
combined with a certain amount of ligand 
molecules. As noted above, the dynamics 
of these phenomena is very similar to MM 
enzymatic dynamics, and can be checked 
in the appendix of Alon’s book [45]. 


3.1.2 Empirically Based Approaches 

Quite often, the approaches detailed in 
the previous section are less useful than 
might be imagined, due to the fact that 
in many cases it is difficult to verify any 
of the hypotheses that underlie the mod- 
els. Consequently, some comments are 
in order. On the one hand, all of these 
rate equations have been written based on 
the hypothesis that the proteins involved 
do not interact with any other substrate, 
which might therefore be omitted from 
the scheme. However, this situation is ab- 
solutely false; for instance, a single TF 
such as PhoP in Mycobacterium tubercu- 
losis can regulate more than 100 genes 
[53, 54]. Hence, problems may emerge if 
it is considered that, either for the case 
of TR interactions or PP regulations, the 
regulators and substrates involved in regu- 
latory systems are under the influence of a 
complex pattern of interactions with many 
more molecules. For example, coopera- 
tive [55] competition between regulators 
to bind the same target [56, 57], or the 
so-called “zero-order ultra-sensitivity” in 
PP interactions [58], will affect the dynam- 
ics of the regulatory systems, albeit in a 
nontrivial manner. Some of the consider- 
ations that can be made regarding these 
points, along with the dynamics of network 
motifs, are discussed in Sect. 3.2. 

On the other hand, the essential issue 
regarding the validity of these models in 
each particular case lies in the difficulty 
of determining the rate constants of the 
equations in advance. Whilst an alterna- 
tive (as in other areas) would be to assume 
a reversal engineering approach to deter- 
mine the constants, this would mean being 
forced to adopt (ab initio) a certain dynam- 
ical model that would reflect a precise 
biochemical mechanism, and not others. 
This situation — besides the fact that mul- 
tiple models can adequately describe the 


same experimental behaviors only by ade- 
quately tuning the rate constants — would 
force an admission that it is not sensible to 
seek a greater sophistication of the models 
according to more detailed and precise bio- 
chemical phenomenology, at least until the 
experimental determination of the individ- 
ual rate constants and other biochemical 
parameters became possible. 

One family of these alternative mod- 
els is that of the piecewise-linear dif- 
ferential equations (PLDEs). In these 
PLDEs, the strict biochemical details of 
the regulatory mechanism are ignored; 
rather, attention is focused on determining 
quasi-empirically the range of regulator 
concentrations that drives the expression 
of a certain target at each of the experi- 
mentally observed levels. Mathematically, 
PLDEs have the following form: 

§: =f (8) — kiS; (15) 
where the components S; of vector S 
are the concentrations of the substrates 
involved in the regulatory system (the set 
of these substrates is called ©). In turn, 
while k; is the usual degradation rate of the 
i-th substrate [43, 44], the function f (S) is 
often defined as follows: 


fi(S) = Y~ kegbi(S) 


jex 


(16) 


where the functions b;j(S) can be defined as 
acombination of sums and multiplications 
of step functions in which the parameters 


6 define the thresholds: 
0, S; < 9, 
+(¢, _ 2] k 
(5,6) = | L5>% ~ 
_ 0, S; > Of 
S;,Ox) = “J 17 
x (5), 3) ee ( ) 
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Essentially, when the set of rate constants 
ki; has been defined, the functions bj(S) 
will control the regions of the phase space 
at which each rate constant (or, eventually, 
combinations of them) drives the produc- 
tion rate of each substrate S; [59-61]. This 
method is closely related to Boolean ap- 
proaches in the sense that the ultimate 
cause of variation in the substrates’ pro- 
duction rates is reduced to step-like, dis- 
continuous functions. Nevertheless, the 
temporal evolution of substrates concen- 
trations is defined continuously by Eq. 
(15), this being an important difference 
between PLDE methods and Boolean (dis- 
crete, in general) approaches. Another 
relevant difference is that whereas in dis- 
crete models, it is necessary to determine 
only an adequate updating rule in order 
to define the possible transitions between 
system states, in the case of the PLDEs the 
parameters (more precisely the rate con- 
stants kj, and the functions b;(S)) must be 
determined in order that the model can be 
defined. 

There are, however, other possibilities. 
In a recent review, Tyson and Novak [32] 
exploited the potential and versatility of 
another ODE-based approach that was not 
attached to any precise biochemical mech- 
anism, and first proposed in Ref. [62]. Ac- 
cording to this scheme, the rate equation 
associated with the temporal evolution of 
the concentration of the i-th substrate S; 
may be defined as follows: 

5; = ki [Ff (a Wi(3)) - Si] (18) 
where the influence of each substrate on 
the production rate S; of is no longer 
carried by a certain combination of step 
functions, but by the sigmoid: 


Ff (6: Wi(S)) = —— 


1+ e-%WilS) a 
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that varies from 0 (in the limit Wi(S) < 
—1/6;) to 1(in the limit W;(S) > 1/6). 
This parameter defines the value of the 
function at its inflection point, whereas 
the variable of the function W;(S) codifies 
the biochemical input to the rate equation 
as follows: 

W;(S) = io + Y > wiS; (20) 
j 


In the latter expression, the coefficients wi; 
measure the influence of the j-th substrate 
on i-th concentration. The versatility of this 
type of approach has been demonstrated 
in Ref. [32], where the main dynamical 
features of most studied network motifs 
created to date were revised. All simu- 
lations in these studies were performed 
following such formalism; moreover, the 
main dynamical behaviors observed in net- 
work motifs, including noise reduction, 
logic processing of signals, pulse genera- 
tion, oscillations, and bi-stability, were also 
reproduced. 

The two families of empirically inspired 
models presented here suppose versatile 
tools in dynamical modeling that can be 
used to reproduce not always obvious 
motif dynamics [32], and have even been 
applied (after adequate refinements) to 
simulate complex physiological cell events 
such as sporulation beginning in Bacillus 
subtilis [63]. However, within the spirit of 
simplifying the mathematical treatment 
of the problem, the differences between 
TR and PP interactions were not taken 
into account. The importance of this fact, 
more than in any other question relative to 
the dynamical differences between these 
two types of regulation, was that, in their 
typical forms presented here, these models 
do not take into account the possibility 
that only a fraction of a regulated substrate 
could be activated (inhibited). 


3.2 
Summing Nodes and Links: From Math to 
Systems Biology 


Having outlined the main biochemical 
and mathematical issues, and having dis- 
cussed the most widely diffused theoretical 
approaches to single regulatory interac- 
tions modeling, the problem — rather than 
being solved —is in fact about to begin. 
The reason for this, paradigmatically, de- 
rives from the general nonlinearity of the 
biochemical processes under study, and 
the problem can be summarized as fol- 
lows. Previously in this chapter, the main 
mathematical tools used to perform dy- 
namical simulations of single regulatory 
interactions (one regulator, one target) 
have been reviewed. Yet, in cell biochem- 
istry the odds of having a protein, the 
concentration of which depends only on 
one variable are null. In attempts to un- 
derstand how structures that involve even 
small numbers of substrates and regu- 
lations (nodes and links) can give rise 
to nontrivial dynamical behaviors, the 
most widely applied technique over the 
past few years has been the exhaustive 
study of the dynamics of network mo- 
tifs. As noted above, a network motif is 
a regulatory system composed of a low 
number of nodes (genes, proteins) and 
links (regulations), typically between two 
and six. In the following subsections, the 
dynamics of the four best-characterized 
motifs — FB loops, single input modules 
(SIMs), single output modules (SOMs), 
and feedforward loops (FFLs)-—are re- 
viewed. 


3.2.1 Simple but Subtle Structures: SIMs 
and SOMs 

Perhaps the two simplest structures that 
can be imagine in the context of small reg- 
ulatory systems are SIMs and SOMs, the 


Fig. 1 (a) Single input modules (SIMs) 
are formed by a regulator responsible for 
the activity of many targets; (b) Single out- 
put modules (SOMs) consist of a single 
target, the expression of which depends 
on many regulators, either transcription 
factors (TFs) or regulating enzymes. The 
black color of all the regulations in the 


painless structures of which are shown in 
Fig 1. Despite their simplicity, these struc- 
tures reproduce certain subtle dynamical 
properties that could pass unnoticed, but 
set out certain modeling issues that force 
the adoption of new hypotheses, in ad- 
dition to the assumptions detailed in the 
preceding sections of the chapter. More 
specifically, in the case of SOM a problem 
arises that has not previously been encoun- 
tered; namely, how can several interactions 
ona single substrate be modeled? Not sur- 
prisingly, multiple answers exist for this 
question. 

Starting with TR interactions, a rich va- 
riety of behaviors can be found depending 
on the way in which the regulators in- 
teract with the target promoter [64, 65]. 
A first case would consist of two regula- 
tors that must interact with each other to 
effectively perform the final activation (in- 
hibition) of target transcription. These nec- 
essary interactions can occur either before 
DNA-binding (the two regulators would 
be a factor—cofactor couple in this case) or 
after the independent binding of the regu- 
lators. For instance, when the DNA target 
regions of each single TF are close — but 
do not overlap — the RNAp affinity to the 
site can be regulated only when both TFs 
have been bound. This type of biochem- 
ical phenomenology drives an AND logic 
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(a) SIM (b) SOM 


figure indicates generic-transcriptional or 
protein—protein-regulatory interactions; 

in subsequent figures, blue lines indicate 
protein-protein (PP) regulations, and red 
lines transcriptional regulatory (TR) interac- 
tions. The arrows represent activations, and 
right-angles represent inhibitions. 


gate, in the sense that the presence of 
both regulators is required for the reg- 
ulation to occur. When both regulators 
are activators, the response of this AND 
logic, three-nodes SOM, can be modeled 
as follows: 


where it has been assumed, for simplicity, 
that both regulators present a similar 
level of cooperation, accounted for by 
a common Hill exponent, H. The final 
term in the equation describes again 
a normal degradation process for the 
substrate. 

There are, of course, many other pos- 
sibilities. For example, when considering 
non-overlapping, independent DNA target 
regions for the two regulators, it is found 
that the activity of each regulator is inde- 
pendent, such that the RNAp activity on 
the promoter may be stimulated by any of 
the two factors, also independently. In this 
case, the device does not function as an 
AND gate, but as an OR gate. For this type 
of noncompetitive double activation - the 
OR logic process - the Hill terms are not 
multiplied but rather are summed, while 


199 


200 | Dynamics of Biomolecular Networks 


the response of the device can be modeled 
as: 


Finally, if the DNA target regions overlap, 
then a simultaneous binding of the two fac- 
tors is physically impossible. This situation 
is referred to as “‘competitive binding,” 
and although the device functions as a log- 
ical OR gate (by force, in this case), the 
mathematical modeling is slightly differ- 
ent. Again for double activation: 


cay) 
1+ (T,) + (e.) 
T. H 
ky ( “) kyS (23) 


In the cases of both Eqs (22) and (23), it is 
sufficient to change the Hill terms in the 
numerator by unity to report for the even- 
tual inhibitory activity of a regulator. It 
should be mentioned here that this math- 
ematical approach offers, spontaneously, 
an adequate description of dual regula- 
tion. Dual TFs are regulators that act with 
different signs, depending on the presence 
of a second regulator in the cell. In this 
sense, by using a variant of Eq. (21) with 
two opposite multiplicands, this important 
phenomenology can be easily described. 
These types of modeling strategy, to re- 
produce different logical combinations of 
inputs of only one substrate, are used 
in Ref. [66]. Nevertheless, the virtually 
unbounded richness of biochemical pos- 
sibilities [35, 49, 67] causes the number of 
possible models to be multiplied. 


When dealing with PP interactions, the 
overview is no less simple. In the sim- 
plest case, of a three-node SOM in which 
there is TR interaction plus PP regula- 
tion, if both processes are activatory then 
the TF will enhance the total amount 
of substrate Sr = $+ S,, while a second 
regulator (now an enzyme) will activate 
the substrate (S — S,) via, for instance, a 
phosphorylation mechanism. On consid- 
ering the dephosphorylation to be sponta- 
neous, and to occur at a kinetic rate ks, 
the only meaningful choice is that of a 
somehow generalized OR gate, that can be 
modeled as follows: 


(24a) 


_ ka (St — Sa) E 
Kua + (St — Sa) 


ks Sa (24b) 


a 


Several possibilities exist when consid- 
ering two enzymes acting on a single 
substrate. In the first case, the conforma- 
tional modifications may be independent, 
with two enzymes binding the substrate at 
different domains, reproducing again OR 
logic performances. However, the situa- 
tion may also be temporally consecutive, 
in which case two consecutive reactions 
are necessary to reach a final, active, or 
inactive-state, the logic, which would be 
that of an AND gate. But the possibilities 
do not end here; the two enzymes might 
compete for a single, common binding 
site whereby cooperative competition in- 
troduces, again, an essential factor to be 
taken into account for modeling. All of 
these factors can be modeled by extend- 
ing MM terms to each logical scheme, 
as performed previously for TR combina- 
tions. Definitively, the exhaustive mathe- 
matical modeling of all possible kinetic 
processes is beyond the objective of this 


chapter. Nevertheless, if the aim is to 
understand the typical dynamics of net- 
work motifs (even as simple as SOMs), 
then all of these considerations regarding 
the types of regulation should be carefully 
considered. 

The dynamical modeling of the other 
structure depicted in Fig 1- the SIM —is 
clearly easier. The interactions of a single 
regulator with more than one substrate 
are plausibly modeled like independent 
interactions, even when a regulator is 
responsible for hundreds of regulations, 
under the hypothesis that the total con- 
centration of complexes (either TF-DNA 
complexes or PP in the case of enzymes) 
is negligible compared to the total amount 
of regulator in the cell. Even when assum- 
ing this highly simplifying approximation, 
experimental studies have shown that, on 
certain SIMs for which target genes codify 
substrates on a single metabolic pathway, 
the kinetic parameters are anything but 
contingent [68]. Consequently, this de- 
vice operates as a type of “biochemical 
pipeline” on which each product appears 
exactly at the time it is required. The ap- 
parent simplicity of the structure hides a 
subtle, useful, and evolved device. 


3.2.2 Oscillators, Clocks, and Bistable 
Switches: FB Dynamics 

Followed the renowned investigations con- 
ducted by Elowitz and Leibler [69] in 2000 
on negative FB dynamics, and by Gard- 
ner et al. [70] and Becksei et al. [71] on 
positive FB dynamics, discussions regard- 
ing FB loops and their rich behaviors as 
clocks, oscillators, and bistable switches 
have entered the fray on many occasions 
[31, 33, 42, 72]. A FB loop can be charac- 
terized topologically as a directed, closed 
loop of regulations. When the product of 
all the links involved is negative, the situ- 
ation is referred to as a negative feedback 
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(NFB) loop, whereas in the opposite case 
a positive feedback (PFB) loop is denoted. 
Starting with two-node FB loops, there ex- 
ist two possible (topological) combinations 
that lead to PFBs: two proteins repressing 
each other; and two proteins activating 
each other. Yet, when differentiating be- 
tween the different types of regulation, 
whether TR or PP, the number of possible 
combinations will be six. 

With regards to their dynamical model- 
ing, the first issue relates to the type of reg- 
ulatory interaction that is being dealt with 
here. When considering two substances 
that interact with each other, this repre- 
sents a TF that presents with two possible 
conformations — inactive and active. In ad- 
dition, if an enzyme E is considered, the 
full activity of which arises spontaneously 
just after translation, no differentiation can 
be made between the enzyme’s active and 
inactive forms. Initially, the transcription 
of enzyme E is activated by the presence of 
acertain TF S, acting as an external signal. 
The enzyme, in turn, catalyzes activation 
of the inactive fraction T via, for instance, a 
phosphorylation mechanism. To close the 
FB, the active form of the TF, T,, enhances 
the expression of E. In that case, the active 
fraction of the TF, T,, would coincide with 
the phosphorylated form, and the enzyme 
could be a kinase. Consider, for simplicity, 
the dephosphorylation of the active form 
Tg as a process that does not require any 
additional enzyme to occur. Finally, the to- 
tal amount of TF (T; = T + Tz) could be 
considered constant, although it certainly 
may depend on other signaling inputs or 
decay processes. 

At this point, it should be noted that 
many other possibilities clearly exist, in- 
volving phosphates, proteins of which the 
active forms are not phosphorylated, and 
other multiple catalytic mechanisms that 
are not concerned with phosphate group 
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Fig. 2. (a—f) The six pos- 
@ sible positive feedback 
(PFB) loops. Note the 
differences between the 
PP- and TR-motifs. 
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transfers. On examining Fig 2, the sys- 
tem previously described corresponds to 
Fig. 2(b). If the TR interactions are mod- 
eled by Hill terms and the PP interaction 
like a MM process, then the system can be 
described by the following system of rate 
equations: 


k,§ k Tea 
- Te : wh ksE (25a) 
Ue ae ee “Ig, 
ka (Tr — T,) EB 
ea ca it (25b) 


“Kua + (Tr — Ta) 


where k; are the mass action-derived rate 
constants of the processes, Ky; are the 
Michaelis constants associated with the 
PP interactions, and 6; are the parameters 
that define the half-maximum values of the 
Hill curves for TR terms. For simplicity, 
any cooperative effects on TR interactions 
have not been considered, and the MM 
mechanisms are standard; hence, no mul- 
tisite phosphorylation processes or more 
complex dynamical effects have been taken 
into account. The binding of the TFs S$ and 
T, on the enzyme’s promoter defines a 
noncompetitive OR logic gate. 

On the other hand, in order to describe 
a PFB composed by a mutual inhibition 
scheme, the modifications required to 
turn the system described in Eq. (25) (see 
Fig. 2(b)) into its mate (as represented 
in Fig. 2e) are twofold. First, T, must 
inhibit the transcription of F rather than 
activate it. Second, if E is still to be 
considered as a kinase, then the active 
form of the TF (T,) will no longer be the 


product of the reaction catalyzed by E, but 
rather the substrate. Hence, the effect of 
E consists of turning T, on, as it is now 
an inhibitor. In other words, in this case 
the phosphorylated form is inactive, such 
that the noncatalyzing dephosphorylation 
may turn the TF active. In summarizing, 
the rate equations are: 


. by ky Tal 
ga 4 gE (26a) 
1+", +74, 
. kaToE 
shelton T, 26b 
a Gattet 5 (Tr — Ta) (26b) 


For a closely similar approach to PFB mod- 
eling, see Ref. [31], whereas in Ref. [32] 
PFBs are also modeled using empirical 
models formally equivalent to Eq. (18). 
The essential dynamical features of these 
systems are discussed in these notable re- 
views, as well as in other many texts (see 
Ref. [33] for a general contextualization 
and Ref. [72] for a more specific treatment 
of bi-stability and FBs). 

With regards to the other family of FB 
structures — the NFB loops —on examin- 
ing two-node structures there is only one 
solution in terms of mathematical signing: 
the activator activates the inhibitor, which 
inhibits the former. By emphasizing again 
the ability to differentiate between possible 
combinations of the different regulations, 
only four of the six different possibilities 
found for PFBs are apparent, of which two 
are the main dynamical regimes that these 
types of structure offer. On the one hand, 
NFB loops can behave like homeostatic 


devices that are capable of preserving the 
stationary responses contained within nar- 
row windows to wide ranges of signal 
concentration. This type of homeostatic 
regulation is most commonly employed in 
biosynthetic pathways. 

On the other hand, NFBs can demon- 
strate sustained oscillations, for which 
mathematical requirements have been 
characterized in detail [73, 74]. Within 
the generic context of a system of two 
chemical species, x1, x2, with production 
rates depending on them through func- 
tions «; = f (x1,x2), it has been proved 
that, in general, when the production 
rates are monotonically increasing 
with the concentrations (i.e., the more 
concentrated is a chemical compound, 
the higher the rate of all reactions it 
participates in), the trajectories are always 
bounded in phase space. That situation 
leads to limit cycle-sustained oscillations 
when the steady state of the system 
is unstable. For these two-component 
systems it has been proved [75] that 
at least three chemical reactions are 
necessary, one of these being autocatalytic 
(i.e., the production rate of one substance 
must depend on its own concentration) 
and involving at least three molecules. 

Some of the biochemical conditions 
yielding oscillatory behaviors were enu- 
merated by Elowitz and Leibler [69] by 
using a repressilator system; this was es- 
sentially a synthetic three-component NFB 
cycle in which each of the proteins re- 
pressed the next. The conditions identified 
were the presence of strong promoters, 
strong ribosome-binding sites, tight and 
cooperative repression, and similar mRNA 
and protein decay times. When the system 
adequately fulfilled these conditions, the 
stationary fixed point of its dynamics be- 
came unstable such that limit-cycle oscilla- 
tions appeared. More recent investigations 
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have been targeted at identifying the 
relationships that exist between the par- 
ticular topology of the FB loops driving the 
processes, and the dynamical footprints 
associated with these oscillatory, experi- 
mental behaviors [76]. 

The NFB loop has been proposed as 
an architecture adapted to sustain oscilla- 
tions in periodical series of biochemical 
events, such as cell cycles [37, 77, 78] and 
circadian rhythms [36, 79, 80]. In these 
studies (as reviewed by Tyson and Alon 
[31, 33]), the more sophisticated dynam- 
ics coupling of PFB and NFB loops was 
examined in greater detail. Beyond these 
considerations, it should be noted that cer- 
tain questions regarding FB loops remain 
poorly studied, notably in the case of ex- 
plicit dynamical implications with FBs of 
PP interactions rather than purely tran- 
scriptional structures. Nonetheless, the 
fact that purely transcriptional FBs are 
much more common in developmental 
TR networks than in sensory TR networks 
suggests that the dynamics of these struc- 
tures may be more useful in the former 
situation [37]. 


3.2.3. FFLs: Noise Management and Pulse 
Generation 

The final structure to be discussed is the 
FFL motif, which is composed of two reg- 
ulators, Rj and R), acting on a common 
substrate, S. One of these regulators, say 
R,, regulates in turn the activity of the 
other regulator, R7. Depending on the sign 
of the regulations, the FFLs have been sub- 
divided into two groups [81], namely coher- 
ent FFLs and incoherent FFLs. In coherent 
FFLs, the sign of the indirect substrate 
regulation R; > R) > S coincides with 
the sign of the direct regulation R; > S, 
but for incoherent FFLs the signs of both 
paths are opposite. By adhering only to this 
sign criterion, there exist four different 
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combinations of incoherent FFLs and four 
further possibilities for coherent motifs. 

There is, however, another source of 
multiplicity, namely the possibility of 
dealing either with TR motifs or with 
PP interactions. By considering that both 
regulations due to R; are of the same type, 
the possibilities ascend to 32 diverse FFLs; 
by further considering the different logical 
implementations for the convergent 
regulations in S, the possibilities amount 
to 48 (a PP regulation and a TR regulation 
cannot define an AND gate). Finally, 
if self-inhibitions in the regulators are 
admitted, and different sensing combina- 
tions are considered (i.e., only Ri can sense 
an external signal or instead both regula- 
tors can sense one signal each), then the 
number of combinations might reach 384. 

The statistics of motifs significance has 
shown that FFLs are generally overrep- 
resented [42]. In an attempt to illustrate 
the main cases of all possible FFLs, three 
noticeable examples we been selected, for 
different reasons: the so-called coherent 
type 1 FFL (Fig. 3 a); its cognate incoher- 
ent type 1 (Fig. 3 b); and an example of a 
non purely transcriptional FFL (Fig. 3 c). 

The three interactions involved in coher- 
ent type 1 FFL are activations. So, for an 
AND gate implementation, and if consid- 
ering both regulators as TF (denoting the 
respective concentrations as T; and T)), its 
behavior can be modeled as: 


So (Tir — Tia) 
"Kuo + (Tir — Tia) 


es es 
nf a 


Te =k ky Tia (27a) 


es 


@ e @ 
Sai ao he 
So So So 


ity a 
(2) (hoy) 
1+ (Tp ) 1+ (T6,) 
(27c) 


The dynamics driven by this scheme 
has, as its main features, noise-filtering 
abilities generated by sign-sensitive delays 
[81]. The scheme is simple: when the 
activation of T via the signal S, occurs, 
the transcription of T) starts. With regards 
to the substrate S, as its promoter activity 
is governed by an AND combination of 
both regulators, it is not immediately 
transcribed, as a certain minimum level 
for the concentration of T, must be 
reached. Only if the signal is persistent 
enough will this activation threshold in T, 
concentration be reached, and so will begin 
to express. If the signal suddenly shuts off, 
the activity of S will fall immediately as T) 
is also diluted rapidly after signal removal 
(see Fig 4). 

For incoherent type 1 FFLs, only the 
regulation of S by T2 is an inhibition. 
By combining again the two incoming 
regulations on S with an AND logic, only 
the rate equation for S will be changed to: 


(Timg,)” 1 


S=k 
ai 14 (Tiap,)" 14 (T,) 


ke S 


(28) 


In this case, T; would act as an ac- 
tivator affected by the coinhibitor T 
that, when present, would invert the 


Fig. 3 (a—c) The three motifs re- 
ceive an external input parametrized 
by the presence of a certain signal- 
ing enzymatic regulator, So. 
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Fig. 4. The term-sensitive sign delay refers to the fact that 
the delay appears only as a consequence of a positive 
signal stimulus, but not after a negative change in signal 


concentration. 


regulation sign from activation to in- 
hibition. The dynamics of this system 
comprised two main characteristics: accel- 
erating responses, and generating pulses. 
These dynamical features have been 
addressed experimentally in E. coli syn- 
thetic [39] and natural [40] regulatory 
systems. As highlighted in Ref. [33], 
the response-accelerating performance of 
this type of structure can be considered 
more important than that of negative 
self-inhibition, as the latter can only accel- 
erate the production of its own regulator 
and genes within its operon, whereas FFLs 
can drive the acceleration of the activity at 


any operon. 
Finally, mention should be made 
of a recent investigation conducted 


by Csikasz-Nagy et al. [51], who 
studied cell-cycle regulation by the 
cyclin-dependent kinase cdK1in budding 
yeast. In this case, it was established 
that the activity of many (more than was 
expected by random) of the proteins 
involved in cell-cycle regulation, the peri- 
odic expression of which was controlled 
by cdK1, were also controlled by TFs 


that followed a pattern that (irrespective 
of the signs) was precisely that of the 
FFL represented in Fig. 3c. Although 
the dynamical models proposed in Ref. 
[51] were simple, the studies were novel 
in terms of integrating TR data with PP 
interactions at a genome—proteome-wide 
level. As noted repeatedly above, the 
dynamics of these small modules is not 
always easy to anticipate, and studies of 
the dynamical behavior of mixed motifs 
may lead to more than one surprise! 


3.3 
Perspectives 


This review of the main features of the 
Boolean network approach to modeling 
gene regulatory networks has not been ex- 
haustive, with only selected topics having 
been presented. Of future interest would 
be an application of the Boolean network 
framework to the predictive modeling of 
real systems. For example, from a practical 
perspective it would be intriguing to deter- 
mine whether the network could be forced 
to switch from one attractor to another, 
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or the long-term behavior of the network’s 
dynamics could be altered. In a disease 
such as cancer, for instance, the possi- 
bility of pushing tumor cells towards the 
apoptosis attractor would surely be of great 
therapeutic interest. 

With the continuous approaches much 
caution must be applied, and the gen- 
eral proviso that what is true for purely 
TR motifs may not be always true for 
general regulatory structures should be 
stressed. Nonetheless, as has been empha- 
sized above, there exist many more ways to 
regulate protein activities than transcrip- 
tional regulations, and their integration at 
a greater scale into a common framework 
is surely a task towards which much effort 
will be devoted in the near future. 

In contrast, the motif approach to reg- 
ulatory networks modeling relies on the 
fact that the true temporal profiles of pro- 
tein concentrations may — in a significant 
proportion of cases — obey the dynamical 
performances of small regulatory motifs 
constituted only by the protein involved 
and a reduced number of regulators that 
affect it. Yet, the question remains as to 
whether motifs represent very general be- 
haviors, or whether they represent scarce, 
cunningly chosen examples that are not 
susceptible to being generalized. 
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Keywords 


E-cell 
Short form for electronic cell: a tool for modeling and simulating cellular pathways. 


Self-supporting cell 
A virtual cell with 127 genes sufficient for survival. 


In-silico modeling 
A model building process using computers. 


Mycoplasma genitalium 
A microorganism with the smallest number of genes. 


Virtual erythrocyte 
An in-silico model of whole erythrocytic metabolism. 


Modeling 
Abstracting and connecting the elements of a system. 


Simulation 
A process of visualizing a model dynamically. 


Cells are massively parallel and massively interactive systems. The grand challenge 
is to understand their structural and functional design and to use the knowledge 
acquired to build useful applications. In traditional settings, it was difficult to 
focus on more than one gene at a time, but recently developed high-throughput 
technologies have enabled studies to be conducted at the whole organism level. 
Nevertheless, data from these experiments are often noisy and require a large 
number of replicates to validate even a single observation. Furthermore, the 
statistical treatment of high-throughput data is also error-prone. To overcome the 
physical and conceptual limitations, there is a need to develop strategies and tools 
to address complex biological problems. Systems Biology studies conducted during 
the past decade have lent credibility to an in-silico approach for understanding 
and engineering whole-cell systems. The E-Cell platform has been specifically 
designed to address network-based problems. The E Cell has been used successfully 
to create a self-sustaining cell with 127 genes —that is, just stable enough for 
survival. In this chapter, some of the basic modeling concepts, their importance, 
the role of E-Cell, and the future challenges of the modeling community, will be 


discussed. 


1 
Introduction 


A model is a representation of a system, 
and reflects a combination of hypothe- 
ses, evidences, and abstractions. Models 
are the closest replicas of actual phenom- 
ena with diagnostic and predictive abili- 
ties. Models should be easily understood, 
controllable, and analyzable for large and 
complex data. The aim of whole-cell mod- 
eling is to provide both a conceptual basis 
and a working methodology for studying 
the cell in its entirety, to replicate the 
known knowledge, identify unknown enti- 
ties, make predictions, and to design exper- 
iments to address unanswered questions. 

A cell by itself is a complete genetic and 
biochemical reactor holding all the infor- 
mation necessary to sustain life. It offers 
an ideal middle path between (extreme 
ends of) atomic interactions and whole 
organs. By creating a whole-cell model, it 
is theoretically possible to stretch out data 
and hypothesis in either direction. Exper- 
imental biology has now reached a stage 
where data analysis and interpretation are 
heavily dependent on in-silico approaches. 
Although the static representation of data 
has traditionally helped develop an overall 
perspective, dynamic modeling aids in bet- 
ter understanding of the cellular decisions 


(Table 1). 

Broadly, whole-cell transactions 
may be classified into enzymatic and 
nonenzymatic processes. Enzymatic 


processes represent most of the metabolic 
events, while nonenzymatic processes 
represent gene expression and regulation, 
signal transduction, and diffusion. In 
order to create a complete virtual cell, 
it is important to have provisions for 
DNA replication and repair, transcription 
and its regulation, translation, energy 
metabolism, metabolism, cell division, 
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Tab. 1 Basics of map construction. 
Type of repre- Description 
sentation 


Linear chain Unidirectional flow 


Branched Two enzymes participate in 
chain one reaction, resulting in 
different products 
Loops Two branches unite, forming 
an inherent dependency 
between them 
Cycles Larger loops comprising 


many intermediates 


signaling pathways, cell membrane 
dynamics (ion channels, pump, nu- 
trients), and intracellular molecular 


trafficking with appropriate mathematical 
representations (Table 2). 

To accomplish the “‘big picture,” the data 
should not only be of good quality but also 
should be treated with intuitive mathe- 
matical representations that accurately de- 
scribe life in vivo. It is noteworthy to men- 
tion that good data is more of an exception 
than the rule! For modeling metabolic 
pathways, the data input typically consists 
of rate constants and concentrations. A 
metabolic pathway usually consists of for- 
ward and reverse reactions (uni-, bi-, ter-) 
of ordered/random types. The inhibitors 
may be intermediate compounds of the 
same pathway or external entities. The 
availability of good data makes the model- 
ing process more or less a straightforward 
process, but often missing links must be 
identified due to incompleteness of infor- 
mation. Problems in doing so arise mostly 
due to numerical reasons — stiffness and 
parameter sensitivity. The main difference 
between data-to-model and model-to-data 
approaches is that, in the former case, the 
starting materials are substrate, enzyme, 
and modifier concentrations, while in 
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Tab. 2. Mathematical representation of cellular processes. 


Process Dominant phenomenon 
Metabolism Enzymatic reaction 
Signal transduction Molecular binding 
Gene expression Molecular binding, 


polymerization, degradation 


DNA replication Molecular binding, 
polymerization 

Cytoskeletal Polymerization, 
depolymerization 

Cytoplasmic streaming Streaming 


Membrane transport 
potential 


Osmotic pressure, membrane 


Typical computational scheme(s) 


DAE, S-Systems, FBA 

DAE, stochastic, diffusion 
reaction 

OOM, S-Systems, DAE, Boolean, 
stochastic, Bayesian, rule based 

OOM, DAE 


DAE, particle dynamics 


Rheology, finite element method 
DAE, electrophysiology 


DAE: Differential algebraic equations; FBA: Flux balance analysis; OOM: Object-Oriented Modeling. 


the latter case the kinetic constants and 
reaction velocities are assumed. However, 
the difference between these two ap- 
proaches sometimes blurs, because in 
real-life situations modeling often involves 
manual data-fitting approaches to match 
an expected output or hypothesis. The ma- 
jor advantage of carrying out simulations 
is not only to study the system per se, 
but also to extrapolate its behavior, in the 
presence of a hypothetical condition — for 
example, a cell with many essential gene 
knockouts. In addition to demystifying 
nonintuitive phenomena, simulation al- 
lows the testing of experimentally unfea- 
sible scenarios and reduces experimental 
costs. Although wet experiments are in- 
dispensable for the advancement of bio- 
logical knowledge, in-silico modeling can 
help to shorten knowledge discovery. With 
the enormous computational power eas- 
ily available today, the challenging part 
in modeling is more conceptual than 
physical. 

An overview of modeling tools, online 
resources, and databases is provided in 
the next section. 


2 
Biological Modeling and Simulation Tools 


A number of promising tools are available 
for studying gene expression, regulation, 
and metabolic pathways (Tables 3 and 4). 
Listed below are the partial descriptions of 
a few such tools. 


DBsolve: URL: http://websites.ntl.com/ 
~igor.goryanin/. DBsolve is an integrated 
development environment for metabolic, 
enzymatic, and receptor—ligand binding 
simulation. It is an ordinary differential 
equation (ODE)-based tool that also in- 
corporates the stoichiometry of chemical 
reactions. The main strength of DBsolve is 
the calculation of steady state, fitting, and 
optimization options. 


Gepasi: URL: http://www.gepasi.org/. 
Gepasi simulates the steady-state and 
time-course behavior of reactions over 
time and space, based on stoichiometry 
and reaction kinetics values. The program 
is based on ODEs. It is a very useful tool 
for conducting metabolic control analysis 
and linear kinetic stability analysis leading 


Tab. 3. Cellular databases and pathways. 
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Name URL 


1. General Online Maps and Pathways 
IUBMB-Nicholson Minimaps http://www.tcd. 
ie/Biochemistry/IUBMB-Nicholson/ 


Boehringer Mannheim biochemical pathways 
http: //www.expasy.ch/cgi-bin/search-biochem- 
index 

Kyoto Encyclopedia of Genes and Genomes 
(KEGG) http://www.genome.ad.jp/kegg/ 


What Is There (WIT) 
http://wit.mcs.anl.gov/WIT2/ 

Enzyme and Metabolic Pathway (EMP) 
http://emp.mcs.anl.gov/ 


Biopathways Consortium 
http://www. biopathways.org/ 
EcoCyc http://ecocyc.org/ 


PathDB 


http://www.ncgr.org/pathdb/ 
http://umbbd.ahc.umn.edu/ 
METAVISTA http://www.metabolic-explorer.com 


2. Regulatory pathways 

KEGG regulatory pathways 
http://www.genome.ad.jp/kegg/regulation.html 

BioCarta http://www.biocarta.com/ 


Biomolecular Interaction Network Database 
(BIND) http://www.bind.ca/index.phtml 


Signal Pathway Database (SPAD) 
http: //www.grt.kyushu-u.ac.jp/spad/ 


Cell Signaling Networks Database (CSNDB) 
http://geo.nihs.go.jp/csndb/ 


Features 


Comprehensive; describes regulatory and 
spatial features of substrates and 
enzymes; available in .gif, .svg, and .pdf 
forms 

Comprehensive; covers many organisms, 
most extensively used by researchers, 
available in online and paper formats 

Huge database on gene sequence, 
regulatory pathways, metabolism, 
molecular assemblies, and so on 

Covers metabolic pathways of over 25 
organisms 

Includes metabolic pathways, reaction 
mechanisms, rate laws, and numeric 
data from research reports 

An open forum for developing technologies 
and standards for biopathways 

Houses all the Escherichia coli pathways 
with an aim of creating its functional 
catalog 

Plant metabolic database management 
system; runs on client server architecture 


Resource for proteomic profiling, metabolic 
profiling, and metabolic flux analysis 


An extension of KEGG database 


Interactive web-based resource on gene 
function and proteomics 

Describes chemical reactions, 
conformational changes, and protein and 
network interactions across various 
species. 

Signal transduction database with 
emphasis on protein-protein and 
protein—DNA interactions 

Contains sequences, structures, functions, 
and reactions involved in cell signaling 


(continued overleaf) 
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Tab. 3 (continued) 


Name URL 


Munich Information Centre for Protein 
Sequences (MIPS) 
http://mips.gsf.de/proj/yeast/CYGD/db/ 
index.html 

GeNet —- Gene Networks Database 
http: //www.csa.ru/Inst/gorb_dep/inbios/genet/ 
genet.htm 

EmbryoNet 
http: //www.csa.ru/Inst/gorb_dep/inbios/genet/ 
embryo.htm 

Genetic network maps 
http://www.csa.ru/Inst/gorb_dep/inbios/genet/ 
access. htm 

Wnt signaling 
pathwayhttp://www.stanford.edu/~rnusse/ 
wntwindow.html 

3. Transcription factors and expression 

TRANSPATH http://transpath.gbf.de/ 


TRANSFAC 
http://transfac.gbf.de/TRANSFAC/index.html 


RegulonDB 

http: //www.cifn.unam.mx/Computational_Geno- 
mics/regulondb/ 

DBTBS http://elmo.ims.u-tokyo.ac.jp/dbtbs/ 


Saccharomyces cerevisiae Promoter Database 
(SCPD) http://cgsigma.cshl.org/jian/ 

Axeldb http://www.dkfz-heidelberg.de/abt0135/ 
axeldb.htm 

NEXTDB http://nematode.lab.nig.ac.jp/ 

MAGEST 
http://www.genome.ad.jp/magest/about.html 

4. Enzyme database 

BRENDA 

http: //www.brenda.uni-koeln.de/ 

ExPASy http://www.expasy.ch/ 


NC-IUBMB 
http://www.chem.qmw.ac.uk/iubmb/enzyme/ 

Ligand chemical database 
http://www.genome.ad.jp/dbget-bin/ 
www_bfind?ligand 


Features 


Functional yeast genomic database 


Developmentally regulated gene networks 


Developmentally regulated genetic 
networks 


Drosophila embryogenesis network 


Drosophila developmentally regulated Wnt 
signaling pathways 


Describes pathways involved in regulation 
of transcription factors 
Transcription factor database 


Transcription regulation and operon 
organization database 


Bacillus subtilis promoter and transcription 
factor database 

Database on promoters and mapped 
regulatory regions of yeast 

Gene expression database of Xenopus laevis 


Caenorhabditis elegans expression database 
Ascidian expression database 


The most comprehensive database on 
biochemical reactions 

Database on protein sequences and 
structures 

Enzyme nomenclature database 


Database of chemical compounds and 
reactions in biological systems 


Tab. 3 (continued) 
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Name URL 


NIST Thermodynamic database 
http: //wwwbmcd.nist.gov:8080/enzyme/ 
enzyme.html 

PROCAT 
http://www.biochem.ucl.ac.uk/bsm/PROCAT/ 
PROCAT.html 

5. Scientific literature search 

PubMed 
http: //www.ncbi.nlm.nih.gov/entrez/query.fcgi 

Medline 
http://research.bmn.com/medline/search 

Scirus http://www.scirus.com 


Features 


Repository on thermodynamics of enzyme 
catalyzed reactions 


Specialized for 3-D active site templates of 
enzymes 


Huge repository of biomedical literature 


The oldest and most comprehensive 
biomedical literature database 

Extremely useful metasearch tool developed 
by Elsevier Science Publishers 


Tab. 4 Tools for drawing pathways. 


Name of the tool URL 


Pathfinder 
http://bibiserv.techfak.unibielefeld.de/ 
pathfinder/ 

Electric arc 
http://home.xnet.com/~selkovjr/ElectricArc/ 

Biopath 
http://biopath.fmi-uni-passau.de/index.html 

Pathway browser 
http: //www-pr.informatik.uni-tuebingen.de/ 
~eiglsper/pathways/ 


Target application 


Dynamic visualization of metabolic 
pathways represented as acyclic 
graphs 

CAD-based, can be used to design 
abstract graphs to electronic circuits 

Used for digitizing Boehringer 
Biochemical Pathways 

Visualization tool, XML-based, requires 
Java 


to determination of the steady state of a 
system. 


Jarnac: URL: http://www.cds.caltech.edu/ 
~hsauro/Jarnachtm. Jarnac is a 
cell-modeling language for describing 
metabolic, signal transduction, and gene 
networks. It is linked to Jdesigner that 
the user interacts with for modeling a 
biochemical event. 


Virtual Cell: URL: http://www.nrcam. 
uchc.edu/. Virtual Cell is a modeling tool 


that associates biochemical and electro- 
physiological data with microscopic image 
data describing subcellular locations. It is 
based on a strong mathematical founda- 
tion, and the results can be analyzed as 
images. Access to the Virtual Cell mod- 
eling software is via the Internet using a 
Java-based interface. 


A-Cell: URL: http://www.fujixerox.co.jp/ 
crc/cng/A-Cell/. A-Cell is a Windows- 
based graphical user interface (GUI) for 
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the construction of biochemical reaction 
models. In addition, it has the capabil- 
ity of importing previously constructed 
models and combining them with the sys- 
tem. 


BioQUEST: URL: _shttp://omega.cc. 
umb.edu/~bwhite/ek.html. BioQUEST is 
a set of building blocks that run on the 
numerical simulation program ‘‘Extend,” 
allowing the user to construct conduct 
time series analysis of biochemical 
reactions. 


Dynafit! URL: http://www.biokin.com/. 
Dynafit is a software for simulating bio- 
chemical reactions. It also provides Vi- 
raFit for analysis of hepatitis C viral data, 
Batch Ki (client server tool) for the determi- 
nation of tight-binding enzyme inhibition 
constants, and Plate Ki (similar to Batchki), 
but runs as a stand-alone application. 


ModelMaker: URL: http://www.modelk- 
inetix.com/. ModelMaker allows the mod- 
eling of continuous and discontinuous 
functions and stiff and stochastic systems. 
It also provides optimization, minimiza- 
tion, Monte Carlo, and sensitivity analysis. 


MetaModel: URL: http://bip.cnrs-mrs.fr/ 
bip10/modeling.htm. MetaModel 3.0 is a 
DOS-based program for simulating simple 
biochemical reactions. 


DMSS: URL: http://www.bio.cam.ac.uk/ 
~mw263/ftp/doc/ISMB99.ps. _ Discrete 
Metabolic Simulation System (DMSS) 
does not employ kinetic parameters, sto- 
ichiometry matrices, or flux coefficients. 
Instead, the rate of a reaction is modeled 
on the basis of competing metabolite 
concentrations or metabolite affinities 
to enzymes, including metabolite and 
enzyme concentrations. 


E-Cell: URL: _ http://www.e-cell.org. 
E-Cell is a modeling and simulation 


environment. The basic concepts and 
applications of the E-Cell system are 
detailed in the following sections. 


3 
The E-Cell System 


3.1 
Introduction 


The raw material for biological complexity 
is an immense diversity in components 
and rules that are employed to create 
and sustain life. In order to understand 
the underlying complexity and engineer 
new systems, it is necessary to create an 
environment that can translate biology at 
the level of mathematics. During the early 
1990s, the concept of in-silico biology had 
just begun to appear on the horizon, but 
the scientific community was waiting for 
proof of the concept. Hence, because no 
system existed at that time, the decision 
was taken to create one and, after many 
trials and errors, a “virtual baby’ called 
E-Cell was born. 

E-Cell is a short form for ‘‘Elec- 
tronic Cell’—a generic object-oriented 
environment for modeling and simulat- 
ing molecular processes of the whole cell 
in user-definable models, equipped with 
graphical interfaces that allow observa- 
tion and interaction. The E-Cell modeling 
approach links diverse cellular processes 
such as gene expression, signaling, and 
metabolism, to form a virtual cell frame- 
work. By using E-Cell, it is possible to cre- 
ate a model and also to translate this model 
into a simulation environment through 
mathematical equations. More precisely, 
however, it is a generic system for con- 
structing object models of the cells that 
can (optionally) emulate the behavior of 
numeric equation solvers. 


The E-Cell project was started at Keio 
University, Japan, in October 1996. The 
first working version of E-Cell was ready 
within three months, and the first vir- 
tual cell (Mycoplasma genitalium) was de- 
veloped within a year. In March 2001, 
the beta version 1.0 of the software was 
publicly released under open source, and 
new GUI and peripheral software tools 
were added. Python is an ideal lan- 
guage for the other user-side components, 
where productivity and readability are de- 
manded (Table 5). E-Cell is an open-source 
project: the entire documentation with 


Tab. 5 E-Cell versions. 
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source code is available from http://www. 
bioinformatics.org/E-Cell/. Bioinformat- 
ics.org is a nonprofit, academic-based 
organization committed to opening ac- 
cess to bioinformatics research projects. 
The publicly available mailing lists have 
been detailed in Table 6. As E-Cell 
is an open-source project, the expected 
third-party contribution includes algo- 
rithm modules, GUIs, new language bind- 
ings, and mathematical analysis modules. 
E-Cell (version 3.0) is a highly modular- 
ized software that can easily be extended 
by writing plug-ins. 


Cell 1.0 (linux) 


Entire E-Cell application 
including GUI written in C++ 


Peripheral programs (er2eri, - 
ss2er, rd2ch, etc.) written in 
Perl, Python, and yacc/lex 


E-Cell 2.0 (windows) 


Uses a cocktail of C++, 
Perl, Cygwin, Java 


E-Cell 3.0 (linux) 


Core portions (libecs, libemc) 
and simulation objects 
(reactors, substances, steppers) 
written in C++ 

Most other components, 
including front end, written in 


Python 


Tab. 6 Public E-Cell mailing lists. 


Mailing lists Features 


E-Cell announce Very low traffic 
moderated ML for 
announcements 
regarding E-Cell 
projects 

For free discussions 
on E-Cell 


E-Cell users 


E-Cell development For developing E-Cell 


3.0 


Address 


e-cell-announce@e-cell.org 


e-cell-users @e-cell.org 


http://www.e-cell.org/mailman/listinfo/ 
e-cell-users 

http://www.e-cell.org/moin/moin.cgi 

e-cell3-devel@bioinformatics.org 


ML = mailing list. 
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3.2 
Architecture of E-Cell 


The description here mainly refers to 
E-Cell version 1.0. The E-Cell software con- 
structs object models equivalent to a cell 
system or a part of a cell system employ- 
ing a structured Substance-Reactor Model 
(SRM). In the SRM, the objects belong 
to one of the three fundamental object 
types (Primitives): Substance, Reactor, or 
System: 


e The Substances represent amounts 
of a molecular species or other state 
variables. 

e The Reactors represent cellular 
phenomena that result in change in 
the amount or value of the molecular 
species or the state variables. 

e The Systems are used as containers for 
the Primitives representing functional 
and/or physical compartments. 


In E-Cell 1.0, a cell model description 
is composed of two parts: the definition 
of subclasses of the Primitives (mainly 
of the Reactor); and a rule file. The 
tule file contains information of: (1) a 
list of the three Primitive objects in the 


Pulldown menu 


model; (2) relationships among the objects 
(e.g., stoichiometry of reactions); and (3) 
parameter values for the objects (e.g., rate 
constants). 


3.2.1 Elements of the Control Panel 
The control panel includes the following 
elements: 


e The substance window shows the quan- 
tity of a selected substance. It also allows 
the user to alter the quantity at will dur- 
ing the simulation process. 

e The reactor window displays the activity 
of a selected reaction. 

e The activity of a reaction is defined as the 
amount of product produced per second 
in the reaction process. 

e Tracers are windows that plot the 
concentration of substances with time 


(Fig. 1). 


3.2.2 Elements of the E-Cell Model 

In the E-Cell system, the substance is a 
substrate, product, catalyst, or an ion that 
affects a reaction. Typically, substances 
include proteins, protein complexes, DNA 
(genes), RNA, and small molecules. The 
total number of molecules involved in 


Step size button 


o E-Cell Control Panel x 


File New Interface Windows 


_ Rule: [default] Script: [test] 


CS: [tt] 


at [0] 


Elapsed Time [s]: 0.0000000 a ze = | 


Time counter 


File name area 


Fig. 1 Elements of the control panel. 


Start => sc button 
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a reaction is defined as quantity, while 
concentration describes the amount of 
substance present in a reaction space (in 
moles per liter). In the E-Cell, the quantity 
of a substance equals {Avogadro’s number 
x concentration x volume}. The E-Cell 
simulation software uses the number of 
molecules in a sample to trace a reaction, 
and automatically converts concentration 
into quantity. The spreadsheet data file 
must be converted to .er text format. It 
is possible to use macros in the .er file 
to model complicated systems with ease. 
In E-Cell 3.0, the .er and .eri file formats 
are no longer used; rather, an XML-based 
E-Cell model description language is used. 
The E-Cell system extracts quantitative 
information from the rule file, links it up 
with the equations described in reactors, 
and plots the velocity curve on tracers. 

A special characteristic of E-Cell is the 
accumulator. The Reserve Accumulator 
(the default feature) is used when decimal 
fractions are unimportant, for example, 
when representing an individual cell. 
However, the Simple Accumulator is used 
in situations in which the floating-point 
value is crucial to the interpretation of 
the results, for example, if the number of 
molecules is very large or if the simulation 
represents an ‘‘average cell’ among a 
large number of cells. The Monte Carlo 
Accumulator is used if the simulation 
requires a high degree of precision in 
statistical analysis. 


3.3 
Features of E-Cell 2.0 


Recently, Mitsui Knowledge Industry 
has released the Windows version of 
E-Cell (ver. 2.0). The E-Cell 2.0 is very 
similar to the E-Cell 1.0, except that the 
virtual memory function (to show the 
concentration in the tracer for a long time) 
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is not implemented for now. However, 
the E-Cell data file contains time, mean 
value, maximum value, and minimum 
value as against time and value output of 
version 1.0. 

To run this version, the following sup- 
porting software is required: 


e *m4.exe distributed in Cygwin 
(http://sources.redhat.com/cygwin) 

e *Java runtime environment (http://java. 
sun.com/products/jdk/1.2/) 

e *Perl (http://www.perl.com/pub/a/ lan- 
guage/info/software.html) 

e *C++4+(http://www.borland.com/ bcpp- 
builder/freecompiler/). 


3.4 
Features of E-Cell 3.0 


E-Cell 3.0 is currently being developed 
with an aim of providing the cell sim- 
ulation community with a generic and 
high-performance software environment. 
It is also Linux-based, and has a geometry 
information interface. It will integrate any 
sets of different simulation algorithms, in- 
cluding the Variable-Process model, differ- 
ential equation-based, diffusion reaction, 
and particle dynamics-based approaches. 
One of the main highlights of the soft- 
ware would be integration of subsystems 
with different timescales. E-Cell 3.0 allows 
many components, driven by different 
simulation algorithms and different time 
scales, to coexist in the simulation by em- 
ploying a discrete-event worldview as its 
fundamental formalism. 

The core simulation software of E-Cell 
3.0 is a set of extension modules for the 
Python language interpreter, written in 
C++/C/Python. This consists of a libecs 
cell modeling tool kit, an E-Cell microcore 
(EMC) layer, a Python language binding 
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(PyECS), and other peripheral Python 
modules. Libecs (code name: Koyurugi) is 
a generic object-oriented C++ class library 
for constructing various object-based cell 
models. One of the special features of 
Koyurugi is that the constructed cell 
models themselves work as simulation 
engines. The E-Cell Micro Core defines 
interfaces and implementations of the 
Simulator class, which provides a simple 
application programming interface (API) 
of the Koyurugi class library. PyECS is 
basically a Python binding of the EMC. 
Libecs, EMC, and a main portion of PyECS 
are written in C++ language. 


3.5 
Advantages of the E-Cell System 


The E-Cell system has four main advan- 
tages: 


1. The E-Cell architecture allows users 
easily to add components to the E-Cell 
simulation software in order to address 
individual modeling needs. 

2. The E-Cell can accommodate many dif- 
ferent types of simulation rather than 
follow one specific methodology; that is, 
itcan simulate deterministic or stochas- 
tic models, either alone or together. 
Thus, users are able to model biological 
systems according to their characteris- 
tics, and incorporate diverse methods 
in the same model. In the first version 
of E-Cell, this is enabled by Reactors. 
Reactors are coded in C++, thereby 
offering users a huge bandwidth for 
simulating a large variety of reactions. 
In E-Cell 3.0, it will be possible to cre- 
ate new types of Reactors, Substances, 
and Systems (and Stepper), thus allow- 
ing more flexibility. For example, users 
will be able to define an integration 
method. 


3. The E-Cell is custom-made, even for 
people with little or no programming 
knowledge. 

4. The E-Cell offers efficient data man- 
agement through Rule files. This is 
particularly useful when large amounts 
of data are available. 


3.6 
Limitations of the E-Cell System 


A primary limitation of E-Cell 1.0 and 
2.0 is that, at present, they do not 
have sophisticated concentration gradient 
model/simulated three-dimensional (3-D) 
structures and molecular dynamics. How- 
ever, this constraint has been overcome 
in E-Cell 3.0. A second limitation is that 
ODEs and algebraic equations can only 
be calculated explicitly. However, in most 
cases users can make adjustments to com- 
ponents to incorporate other calculation 
methods (such as incorporating a library 
within the reactor that allows the implicit 
calculation of ODEs). 


3.7 
E-Cell with 127 Genes 


The E-Cell with 127 genes is a hypo- 
thetical cell that contains the minimum 
gene set for survival (Fig. 2). For this, 
the genomic construction from M. genital- 
ium was borrowed to build a first virtual 
cell to conduct what was termed ‘“min- 
imum cellular metabolism.” This model 
takes up glucose from the culture medium 
using a phosphotransferase system, gen- 
erates ATP by catabolizing glucose to 
lactate through glycolysis and fermen- 
tation, and then exports lactate out of 
the cell. The enzymes and substrates are 
synthesized spontaneously and degraded 
over time to sustain “life.” The protein 


Glycolysis 


Lipid biosynthesis 


Fig. 2. E-Cell with 127 genes. 


synthesis is implemented by modeling 
the molecules necessary for transcription 
and translation, namely, RNA polymerase, 
ribosomal subunits, rRNAs, tRNAs, and 
tRNA ligases. The cell also takes up glyc- 
erol and fatty acids, and produces phos- 
phatidyl glycerol for membrane structure, 
using a phospholipid biosynthesis path- 
way. The model cell is ‘“‘self-supporting,” 
but not capable of proliferating; the cell 
does not have pathways for DNA repli- 
cation or the cell cycle. The Mycoplasma 
ammunition used was formed of genes 
involved in glycolysis (n = 9), lactate fer- 
mentation (n = 1), phospholipid biosyn- 
thesis (n = 4), phosphotransferase system 
(n = 2), glycerol uptake (n = 1), RNA poly- 
merase (n = 6), amino acid metabolism 
(n = 2), ribosomal L-subunit (n = 30), ri- 
bosomal S-subunit (n = 19), rRNA (n= 
2), tRNA (n = 20), tRNA ligase (n = 19), 
initiation factor (n = 4), and elongation 
factor (n = 1). Overall, this resulted in 98 
protein-coding genes and 22 RNA-coding 
genes. The remaining seven genes were 


E-Cell: Computer Simulation of the Cell 


CTP) 


Phospholipid 
bilayer ss 


imported from other sources. All of this 
was spread out into 495 reaction rules 
that modeled the enzymatic reactions re- 
sponsible for increasing/decreasing sub- 
strate/product quantities, multisubstrate 
complex formation, transportations of 
substances, and stochastic processes, for 
example, transcriptional factors binding 
factor binding to a specific site of the chro- 
mosome. 

Mycoplasma genitalium was chosen for 
constructing a virtual cell with the min- 
imum number of genes for survival, be- 
cause it has the smallest known genome. 
Its genomic sequence (580 kb) was deter- 
mined in 1995. The gene set of M. gen- 
italium was abstracted to accommodate 
only those genes required for the bare, 
essential cellular metabolism. At the time 
of developing the first version of E-Cell, 
120 genes from M. genitalium were iden- 
tified and well documented. However, 
in order to successfully hand-construct 
a self-sustaining cell, this number just 
fell short by seven. This shortage was 
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made up by bringing in genes from “ex- 
ternal sources” —four for phospholipid 
biosynthesis, one gene each for nucleoside 
phosphate kinase and nucleoside diphos- 
phate kinase, and one for tRNA ligase. At 
that time, the phospholipid biosynthesis 
and a few other pathways in M. geni- 
talium were not well characterized. The 
information on the kinetic properties of 
genes and proteins was mostly obtained 
from the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) and BioCyc (previously 
called EcoCyc) databases. 


3.8 
Applications of the E-Cell System 


The E-Cell with 127 genes predicted for 
the first time a sudden sharp increase in 
the ATP level in a glucose-starved cell, 
followed by the equally sharp decrease of 
ATP. Although the second event was ex- 
pected, the first event was a major surprise 
and was later confirmed experimentally. 
This anomalous situation was explained 
by the fact that, although ATP production 
was stalled by cutting off glucose levels, 
it took the cell a short time to consume 
the intermediates for ATP production, and 
this resulted in a sudden increase of ATP. 
This example demonstrates the potential 
of in-silico modeling for generating new 
information. Furthermore, recent point- 
ers have indicated a paradigm shift from 
experimental biology to in-silico biology. 

Overall, the E-Cell has applications in 
the following areas: 


Metabolic requirements: The assessment 
of a cell’s metabolic requirements is 
an area that the E-Cell can successfully 
address. At present, M. genitalium is grown 
in a complex medium containing sev- 
eral chemically undefined components, 
including fetal bovine serum and also 


yeast and beef extracts. By combining 
knowledge of the metabolic enzymes 
present in a cell with information concern- 
ing protein transporters of metabolites 
across the cell membrane, it should be pos- 
sible — by using the E-Cell model — to eval- 
uate whether a particular defined medium 
can support growth. 


Gene expression: E-Cell software can be 
used to decipher gene regulatory networks. 
The plan is to use M. genitalium to achieve 
this objective. 


Minimal gene set: The self-sustaining 
E-Cell will be further extended to define 
the minimal set of genes required for a 
self-replicating cell under a specific set of 
laboratory conditions. 


Clinical applications: Currently, investiga- 
tions are being undertaken to determine 
the clinical applications of the E-Cell; 
examples include diabetes and enzyme 
deficiencies in erythrocytes (see Sect. 3.9). 


3.9 
Simulation of Erythrocyte Enzyme 
Deficiencies 


Glucose-6-phosphate dehydrogenase 
(G6PD) is a key enzyme that produces 
NADPH in the pentose phosphate 
pathway (Fig. 3). Initially, G6PD converts 
glucose-6-phosphoric acid into 6-phos- 
phoglucono-1,5-lactone (thus generating 
NADP), which is then metabolized to 
ribulose-5-phosphoric acid via 6-phos- 
phogluconic acid, generating NADPH 
in the process. Within the erythrocyte, a 
major function of glutathione (GSH) is to 
eliminate superoxide anions and organic 
hydroperoxides. Peroxides are eliminated 
through the action of glutathione 
peroxidase, yielding oxidized glutathione 
(GSSG). 
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Fig. 3 The whole erythrocyte model. 
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A G6PD deficiency has been imple- 
mented into the E-Cell model, and the 
kinetic parameters have been modified 
in accordance with the biochemical 
environment of mutant cells taken from 
patients reporting this deficiency. The 
simulation experiments were carried 
out with steady-state concentrations 
corresponding to those of the normal 
erythrocyte. Sequential changes in the 
quantity of NADPH, GSH, and ATP were 
observed in the simulation experiments. 
However, the longevity of the computer 
model, as estimated by the concentration 
of ATP, was found to be much shorter 
than that of the “real” erythrocyte with 
G6PD deficiency. This difference was, 
presumably, due to a lack of pathways 
producing GSH, and of the export 
system for GSSG. After modification, 
however, the longevity of the cell and 
the GSH/GSSG ratio was found to have 
increased. These results indicate that 
these pathways partially compensate for 
the reduction of GSH and have a role 
in easing anemia, a condition which is 
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Fig. 4. Simulation of erythrocyte deficiency. 


related to G6PD deficiency. This result 
can also provide a good explanation 
for the fact that G6PD deficiency is 
the most common cause of anemia. If 
the deficiency with these compensation 
pathways had no severe disadvantage 
on survival, then the condition would 
spread through the population. When 
the activity of G6PD is decreased, 
the activity of 6-phosphogluconate 
dehydrogenase is increased, thereby 
compensating for the reduced production 
of NADPH. However, because either 
6-phosphoglucono-1,5-lactone was not 
supplied, or because there was a deficiency 
of G6P, the 6-phosphogluconic acid 
supply was rapidly exhausted and the 
production of NADPH stopped. Conse- 
quently, the amount of NADPH began 
to reduce gradually and soon became 
exhausted. The level of GSH then began 
to decrease due to its conversion into 
GSSG. Finally, the metabolic performance 
of the cells worsened when the ATP 
became exhausted due to an inhibition 
of the rate-determining enzymes (due to 
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low GSH/GSSG). This model of G6PD 
deficiency correlates well with the clinical 
situation, and may serve as a “‘test bed” 
for extending the model to other human 
erythrocyte metabolic disorders (Fig. 4). 
Recently, an erythrocyte model with 
a pyruvate kinase (PK) deficiency has 
also been reconstructed. On creating an 
in-silico model for PK deficiency, the ATP 
production rate was found to decrease 
proportionately, leading to an eventual 
elimination from the system. Simultane- 
ously, an increase in the concentration of 
2-phosphogrycerate, 3-phosphoglycerate 
and phosphoenolpyruvate was observed, 
which is in agreement with the clinical 
presentation of the PK phenotype. 


4 
Practical Applications 


The E-cell offers possibilities of creating 
new opportunities for drug target selec- 
tion, based on predictive models. For 
example, pathway-based disease models 
can assist at the preclinical stage to identify 
any potential toxic effects of “lead com- 
pounds.” Ifa compound targets a network 
hub, the possibility that such a drug would 
give rise to adverse side effects is quite 
high. However, if drug targets are found 
first, to be either non-hubs, terminal nodes 
or linkers in the network, or second, mul- 
tiple weak binders which collectively bring 
about the effect, then such compounds 
will be preferred as candidate drugs. To- 
day, many companies employ disease- and 
population-based drug response models to 
lower their R&D costs. A prior assessment 
of side effects/toxic effects can result in a 
speeding up of the drug discovery process, 
leading to significant savings. 

By producing detailed “route maps” of 
the molecular circuitry of a cell, it is 


E-Cell: Computer Simulation of the Cell 


possible — at least in theory-to develop 
smarter therapeutic strategies. However, 
the success of this strategy depends on 
the completeness and accuracy of the rel- 
evant data acquired. Previously, Systems 
Biology has played a key role in provid- 
ing an understanding of the drug gefitinib 
(Iressa; Astra Zeneca), for drugs to treat 
liver abnormalities (Pfizer), and of kinase 
inhibitor mechanisms (Johnson & John- 
son) [1]. In general, the systems approach 
has resulted in better descriptions of many 
biological systems, leading to the possibil- 
ities of systems design and engineering 
and the promotion of a new discipline 
termed Synthetic Biology. In Synthetic Biol- 
ogy, attention is focused on the ground-up 
engineering of novel systems for useful 
applications. However, in order that Syn- 
thetic Biology becomes successful, it is 
first important that Systems Biology ap- 
proaches continue to generate new data, 
and to provide new models and new de- 
scriptions of how biological components 
collaborate to generate distinct pheno- 


types. 


5 
Concluding Remarks 


In order to understand the whole, it is im- 
portant first to study the whole. Given the 
enormous complexity and data generated 
by the genome, proteome, transcriptome 
and metabolome, computer simulations 
are clearly indispensable for future biolog- 
ical research. Whether, or not, it is feasible 
to construct a computer model of a whole 
living cell remains an open question. Al- 
though attempts at whole-cell modeling 
were not made until the late 1990s, the 
importance of computer simulations of 
cellular metabolism has in fact been re- 
alized since the 1980s [2], with various 


227 


228 


E-Cell: Computer Simulation of the Cell 


cellular processes such as gene expression 
[3, 4], cell cycles [5, 6] and metabolic 
pathways [7-11] having been modeled 
and simulated independently. In order 
to understand the crosstalk among these 
seemingly “self-regulating systems,” it is 
necessary to construct an integrated model 
of the cell. But, one of the major problems 
when constructing large-scale models is a 
lack of quantitative data, since most of the 
biological knowledge currently available is 
of a qualitative nature (in the form of path- 
way maps). Unfortunately, the quantitative 
data available are often noisy and not well 
suited to simulations [12]. Thus, a major 
challenge in this respect is to collect large 
amounts of very accurate (and preferably 
time series) data, to construct quantita- 
tive models, and to “train” the models 
with additional results acquired from the 
laboratory, until the simulation matches 
“real-life” biology [13, 14]. The problem is 
that, in order to achieve this objective, it 
is not only good data that are needed but 
also a novel computational and software 
engineering approach. This is, indeed, the 
“new biology of the twenty-first century.” 
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Keywords 


Computational synthetic biology: 


Describing and predicting the function of synthetic biological systems using computer 


simulations. 


Synthetic biology tools: 


Computational and experimental tools that enable the development of synthetic 


biological systems. 


Automatic design of synthetic systems: 


The automatic production of mathematical models of synthetic biological systems by 
inserting only those components of which the system consists. 


Modeling synthetic constructs: 


The development of mathematical representations of synthetic constructs controlling 


cellular behavior. 


Modeling genetic circuits: 


The development of mathematical representations of genetic circuits controlling cellular 


behavior. 


The details are presented of SynBioSS, a computational design tool for synthetic 
biology. In SynBioSS, the user simply inputs the molecular components of a 
synthetic biological system and the software outputs a reaction network that models 
the biomolecular interactions of this system. This software assists practitioners with 
little or no modeling experience in quantitatively describing synthetic biological 


systems. 


1 
Introduction 


During the past decade, increasingly com- 
plex synthetic biological systems have been 
constructed [1-3] that are largely based on 
the first, prototypical, synthetic systems. 
Examples of such earlier systems include 
the bistable switch [4], repressilator [5], 
various genetic switches and sophisticated 
circuits [6-12], and bio-logical gates of 
varying complexity [13-16]. More recently, 


a trend has emerged towards synthetic 
constructs whose function chiefly relies on 
the synergy between cells [17-20]. Notable 
tasks performed by these multicellular 
synthetic systems involve the synchro- 
nization of cells [21] and the rescuing or 
killing of cells [22-24], among other func- 
tions based on cell-to-cell communication 
[17-20]. 

Useful applications for synthetic bio- 
logical constructs abound. Thus far, 


synthetic biology has contributed towards 
preventing and healing infections, advanc- 
ing cancer treatments, designing sophisti- 
cated vaccines and improving cell therapy 
and regenerative medicine [25, 26]. In ad- 
dition, synthetic biology has facilitated the 
production of alternative fuels, drugs, and 
other biomaterials [27-31]. 

A central challenge in the field of syn- 
thetic biology is to understand the complex 
and, sometimes, nonintuitive behavior of 
a synthetic construct. This challenge is 
to an extent related to the multicompo- 
nent, nonlinear network architectures of 
synthetic biological systems [32-35]. To 
overcome this challenge, a considerable 
fraction of synthetic biology studies have 
paired mathematical modeling with exper- 
imentation in order to design, characterize 
and test synthetic systems and, subse- 
quently, to predict their behavior under 
different conditions [36-39]. 

The approaches implemented to 
model and simulate the behavior of 
synthetic biological systems vary signifi- 
cantly, and include — but are not limited 
to—deterministic, stochastic, discrete, 
continuous, and hybrid methods that 
are combinations of the above [40-44]. 
These quantitative methods, coupled 
with experimental efforts, have fostered 
the design, tuning and improvement of 
numerous synthetic biological systems. 

With the field of synthetic biology 
continuously expanding, there is a strong 
impetus for sophisticated and practical 
computational tools which allow for the 
quick, accurate, and inexpensive design of 
synthetic biological systems. To this end, 
a number of software packages have been 
developed, and have contributed in the 
design and testing of synthetic biological 
constructs [45-47]. In this chapter, 
numerous available computational tools 
for modeling biological systems are briefly 
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described. A computational package, 
SynBioSS Designer, which facilitates 
the automatic generation of biochemical 
reaction networks that describe the 
biomolecular events engaged in the 
synthetic systems is then discussed [48]. 
One broadly used computational tool for 
modeling synthetic systems is COPASI. 
This tool allows the user to perform a 
number of analyses such as transient and 
steady-state simulation, sensitivity analy- 
sis, and metabolic control analysis [49]. 
Another commonly used software for the 
simulation of biological systems is CellDe- 
signer, which includes multiple features 
such as the graphical representation of 
the constructs using the systems biol- 
ogy graphical notation, systems biology 
markup language (SBML), and different 
simulation and analysis packages [50]. 
Other notable software packages that have 
advanced the design of synthetic biologi- 
cal systems include — but are not limited 
to — SynBioSS [51, 52], Biojade [53], Tin- 
kercell [54], Asmparts [55], ProMOt [56], 
Genocad [57], and CADLIVE [58]. 
SynBioSS is a software suite composed 
of three distinct components: SynBioSS 
Desktop Simulator (SynBioSS DS); Syn- 
BioSS Designer; and SynBioSS Wiki [48, 
51, 52]. The SynBioSS DS is a simulation 
tool supported by a user-friendly interface. 
Here, the user inputs a set of biochemical 
reactions that capture the biomolecular in- 
teractions of the system and a few other 
variables such as the cell volume and cell 
division time. SynBioSS DS then imple- 
ments a hybrid, stochastic-discrete and 
stochastic-continuous algorithm to sim- 
ulate the evolution of the biochemical 
reaction network in time. SynBioSS DS is 
capable of simulating small reaction net- 
works in a simple computer, without the 
need for any particular software. Another 
version of SynBioSS, Hybrid Stochastic 
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Simulation for Supercomputers (HY3S) 
algorithm, which runs in MATLAB has 
been also developed [59]. It should be 
stressed that the computational cost of 
simulating reaction networks scales with 
the size of the network, thereby necessitat- 
ing the use of supercomputers when the 
simulated reaction network is large. Both, 
SynBioSS DS and Hy3S have been used 
successfully in several studies to stochas- 
tically simulate the behavior of not only 
synthetic but also naturally occurring bio- 
logical systems [6, 9, 13, 60, 61]. 

The second component of SynBioSS, 
SynBioSS Wiki, consists fundamentally 
of two items: (i) a web interface which 
relies upon the MediaWiki package; and 
(ii) a database where the molecular com- 
ponents, along with their corresponding 
kinetic parameters, are stored. The pri- 
mary purpose of SynBioSS Wiki is to 
serve as an information toolbox. Practi- 
tioners can store information on biolog- 
ical constructs of interest, and retrieve 
this information and apply it to the de- 
sign of biological systems. More details 
regarding SynBioSS Wiki can be found 
at: https://www.synbioss.org/wiki/index. 
php5/Main_Page. 


USER INPUT 


Specification of: 


pitetR) 
1. Biological parts 
e.g. Components of BBa_T9002 


pttetA) tux tux pR 
0040 80034 C0082 B0010B0012R006280032E0040B0010 80012 


2. Proteins 
a) Transcribed (LuxR, GFP) 
b) Constitutive (TetR) 


3. Effectors 
e.g. aTc, HSL 


Fig. 1 Steps of SynBioSS Designer [48]. 
“Transcribed” proteins are produced by the 
synthetic biological system itself. Their produc- 
tion is modeled explicitly in SynBioSS. On the 
other hand, “constitutive” proteins are pro- 
teins that exist naturally in the cell, interact 


DESIGNER ALGORITHMS 
1. Detection of “transcriptional units” 


luxe 
F040 80084 Co0s2 80010 BO012 


lux pR GFP 
R062 80032 E0040 B0010 BOO12 


Generation of asssociated reactions: 
Basal transcription, translation 


2. Protein reactions: 
Multimerization, binding to DNA 
and/or effectors 


3. Generation of additional reactions; 
Activation, leakiness, 
degradation, transport 


The third part of SynBioSS, SynBioSS 
Designer, is a computational tool which 
automatically generates networks of bio- 
chemical reactions that model synthetic (or 
natural) biological systems. Similar to the 
SynBioSS DS, SynBioSS Designer starts 
with a user-friendly interface whereby the 
user enters the molecular components of 
interest. These components may be single 
inducer molecules, repressor and activator 
proteins, or promoters and coding regions 
of DNA [48]. 

One of the most salient features of 
SynBioSS Designer is its capacity to imple- 
ment BioBricks. These are synthetic DNA 
sequences whose function and structure 
are well determined. There are three 
different levels of BioBricks which include 
constructs of different complexity: parts, 
devices, and systems [62, 63]. BioBricks are 
well aggregated in the Registry of Standard 
Biological Parts, a depository of synthetic 
biological parts [64]. Designer is linked to 
this depository and, through its tabbed in- 
terface, endows the user with the ability 
to visualize and handle entire sequences 
of BioBricks. More specifically, the user 
can readily pick the desired BioBricks and 
alter, add or delete different properties; 


OUTPUT 


Reaction network 
SBML or NETCDF format 


with and regulate the synthetic device, but 


are not produced by it. The concentration of 
“constitutive” proteins, such as RNApol, is 
considered constant in SynBioSS models, un- 
less otherwise indicated. 


Designer then automatically generates the 
reaction cascades associated with the func- 
tionality of these particular BioBricks [48]. 
An illustration of the information that 
the user inserts in the Designer, the im- 
plementation of the information by the 
algorithm and its output is shown in Fig. 1. 

In the following sections, two represen- 
tative examples are presented of how to 
use Designer to generate reaction net- 
works for modeling synthetic biological 
constructs. The first example illustrates 
the input of a simple genetic device into 
Designer and highlights Designer’s con- 
nection to the Standard Biological Parts 
Web Service. The second example involves 
a larger, more complex device that is en- 
tirely user-defined. 


2 
Modeling Synthetic Systems Made by 
BioBricks 


2.1 
Modeling BioBricks 


BioBrick BBa_I7100 is a TetR-repressible 
green fluorescent protein (GFP) genera- 
tor, as described in the Registry of Stan- 
dard Biological Parts [64]. As shown in 
Fig. 2, this device is composed of a sin- 
gle transcriptional unit which includes a 
TetR-repressible promoter (R0040), a ribo- 
some binding site (RBS) (B0030), a coding 
region for GFP (E0040), and two termi- 
nators (B0010 and B0012). When TetR is 
present in this system, the expression of 
GFP is repressed; conversely, when the 
inducer aTc is added, it causes a confor- 
mational change in TetR, preventing its 
binding to the promoter and consequently 
inducing the production of GFP. 

To fully specify any device in Designer, 
the user must input: (i) the series of parts; 
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p(tetR) GFP 
R0040 BO030 E0040 B0010 B0012 


(Pte -—-- 


Fig. 2. Sequence of parts for Bio- 
Brick BBa_I7100. 


(ii) the proteins in the system; and (iii) any 
effector molecules in the system. When 
the device is based on BioBricks (as in 
this example), several of these details can 
be automatically imported into Designer. 
The process of user input for BBa_I171000 
is detailed below. 


2.2 
Inserting and Modifying BioBricks 


The first page of Designer’s interface 
allows users to add BioBricks via three 
different methods. If the user knows 
the exact BioBrick they wish to add, 
the BioBrick ID can be entered into 
the “‘BioBrick Search” field. Designer 
is capable of importing individual parts, 
such as BBa_R0040, as well as composite 
BioBricks, such as BBa_I7100. Therefore, 
for large devices there is no need to enter 
individual bricks one-by-one. The second 
method for adding parts to Designer is to 
create custom ‘“BioBricks’”; this method 
is explored in more detail in the second 
example presented herein (see Sect. 3). 
The third method is a coding DNA 
search, where the user may enter the 
name of a protein into the corresponding 
search field, and Designer will then return 
a list of coding DNA BioBricks likely 
corresponding to that protein. 

Continuing with the BBa_I71000 
example, it is necessary only to search for 
its ID (see Fig. 3), and Designer will then 
retrieve and display all of its individual 
parts (see Fig. 4). The user can click on 
the tab for any part to access its properties 
and edit them as desired. Any changes to 
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rT e 5 
= al = = 
& * a 
Add Load Help 


Biobrick search 


Biobrick ID: BBa_17100 


Fig. 3 


Input of BioBrick ID via the SynBioSS Designer web interface. 


@: <0 ? le qq Feet == 2 


Add Load Help BBa_R0040 


Specifics 

Type: Coding DNA 
Protein: GFP_protein << (/ Activator [w]) 
Activator 


Repressor 
Reporter 


Shift, brick left... | Shift.brick. right... | Reset,brick (Delete brick } 


BBa_B0030 


BBa_B0010 BBa_B0012 


Fig. 4. Users can edit any part. Specification of coding DNA shown here. 


the default values pulled from the Parts 
Registry are highlighted with red text. 
Individual default values can be restored 
by clicking the gray, underlined ‘‘rewind” 
symbol directly to the right of the value. 
Clicking the “Reset Brick” button at the 
bottom of the tab resets all properties of 
the brick to their default values. The other 
buttons at the bottom of the tab can be 
used to reorder parts (ie., “Shift Brick 
Right” or “Shift Brick Left’), or remove 
them from the construct entirely (ie., 
“Delete Brick’). 

Although Designer imports as much 
information as possible from the Parts 
Registry, it is always necessary to input 
additional information for coding DNA, 


and sometimes necessary for promoters. 
In Designer, promoters are considered 
to be constitutively “ON” by default, 
but can be switched to “OFF.” For the 
present BBa_I7100 example, the default 
“ON” setting is appropriate. However, 
the imported names of the R0040 TetR 
binding sites, ‘““TetR_1” and “‘TetR_2,” are 
similar to the protein dimer “‘TetR2” in 
this system and thus the operator sites 
are renamed as “‘tetO1” and ‘“‘tetO2.” As 
for coding DNA, the protein type (e.g., 
activator, repressor) must be specified by 
the user. Clicking on E0040, the protein 
type is set to “Reporter” as shown in 
Fig. 4. RBS and terminator BioBricks also 
have their own tabs, but checking them 


is unnecessary as Designer assigns these 
bricks no properties. 

Once the properties of all parts have been 
specified, the user can click “Continue to 
2/3” at the very bottom of the web interface 
(below the tabbed display) to proceed. 


2.3 
Protein Input and Specifics 


After the sequence of BioBricks and corre- 
sponding details have been determined, 
the next step is to add additional pro- 
teins to the system, if any, and to specify 
the binding behavior of all proteins. As 
GFP is produced by the device itself, the 
only additional protein present in this 
example system is the constitutively ex- 
pressed TetR. This protein can be added 
by entering its name in the “Input Pro- 
teins” subsection and selecting its type 
(‘‘Repressor”’ in this case). 

The next field, “Complex Specifics,” 
is necessary. Regulatory proteins must, 
in principle, form complexes before they 
can bind to DNA, and the number of 
subunits in these complexes must be 
specified in Designer for each protein. 
In the example system, TetR dimerizes 
to form TetR2. To describe this behavior, 
“TetR” is first selected from the ‘“‘Complex 
Specifics” dropdown menu; then input 


Binding specifics 
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“2” as the “Number of Subunits in 
Complex” and click ‘“‘Add Complex Info.” 
Additionally, for GFP ‘‘1”’ must be entered 
to explicitly indicate that it does not form 
any complexes of interest. 

Protein interactions with promoter op- 
erator sites must be specified in the 
“Binding Specifics’ subsection. This is 
achieved by selecting matching pairs of 
proteins and their binding sites in the 
two dropdown boxes and clicking ‘‘Add 
Binding Info.” In this example system, 
TetR binds with both “‘tetO1” and “tetO2” 
on promoter R0040. Figure 5 shows how 
this information is added. Once specified, 
the protein—DNA-binding information is 
added to the “Binds” column for the 
corresponding protein in the ‘Current 
Proteins” table. 

When the “Current Proteins” table is 
complete, the system is ready for the 
addition of effector information, which 
can be specified after clicking ‘Continue 
to 3/3.” 


2.4 
Effector Input and Specifics 


Effector molecules, if any, are added 
individually via the “Input Effectors” 
subsection. It must be borne in mind that 
the sole effector in this example system 


Please specify which operator sites the protein(s) can bind to (if any), 


Protein: TetR 


Current proteins 
Protein Type 
GFP_protein Reporter 
TetR 


Fig. 5 Protein—DNA interactions are specified. 


Complex 


Repressor TetR2 


~ Operator: tet01) » |Add binding info 


tet02 


Binds 


GFP_protein Optional 


Optional 


235 


236 


SynBioSS Designer Modeling Suite 


is aTc. Once added, effectors appear in 
a “Current Effectors”’ table. As prompted 
by this table, users must also specify the 
number of times that an effector can bind 
to a protein complex. In this example, aTc 
binds to TetR dimers a maximum of two 
times to form a TetR2:aTc2 complex. To 
introduce this information, select “aTc” 
and “TetR” in the first set of dropdown 
menus in the “Effector Specifics” subsec- 
tion, and enter “2’’ as ‘Max Effectors per 
Complex.” 

By default, Designer assumes that pro- 
tein complexes are capable of binding 
DNA without additional help from other 
molecules. Whilst this is true for TetR2 in 
this example system, certain proteins (e.g., 
LuxR) must be bound with effectors (e.g., 
homoserine lactone; HSL) before they can 
bind to DNA. This type of information can 
be added by selecting the protein and ef- 
fector pair in the second set of dropdown 
boxes in the “Effector Specifics” section 
and clicking “Act in Concert.” 

Now thatall parts, proteins, and effectors 
have been added and fully specified, De- 
signer has all the necessary information to 
generate a reaction network describing the 
system. The user can generate a NetCDF 
or SBML file as output. The reaction net- 
work generated for this specific example is 
given in Table 1. The kinetic data in this 
table are assigned based on a set of default 
values used by Designer. Additional de- 
tails on the general contents of Designer 
output files are provided in Sect. 2.5. 


2.5 
Designer Output 


Designer outputs a network of reactions 
describing all the aspects of gene expres- 
sion and regulation in the system. To 
run a simulation using these reactions, 
initial and environmental conditions are 


also essential. Designer ascribes default 
values for these conditions, as depicted in 
Table 2. Furthermore, the initial volume of 
the system is 10~) L by default. Note that 
the default initial amount of proteins and 
effectors (““Other”’ in Table 3) is set to zero, 
so the user must be certain to edit these 
values as desired, for example, varying aTc 
and constitutively expressed TetR as in the 
example in Sect. 2.1. 

Example reaction networks are provided 
in Tables 1 and 3. All reactions have ele- 
mentary rate laws with kinetic constants 
in terms of moles, liters, and seconds. 
Designer also assigns reasonable default 
kinetic constants to each reaction; these 
constants are not automatically tailored 
to the specific system, however, and as 
such must be retrieved manually from 
SynBioSS Wiki or relevant literature. De- 
signer outputs either a NetCDF or SBML 
file, which can then be loaded in a simula- 
tion software of the user’s choice, such as 
SynBioSS DS [48]. 


3 
Modeling Synthetic Systems Made by 
User-Defined Genetic Constructs 


3.1 
Modeling User-Defined Genetic Constructs 


Consider now the case of a lac-tet-ara 
“repressilator” [65], a synthetic system 
that was computationally designed and 
inspired by the original “repressilator’ 
[5]. The overall behavior of this device 
is that AraC represses the expression of 
LacI, LacIl represses the expression of 
TetR, and the latter represses AraC expres- 
sion, as shown in Fig. 6. This device, 
therefore, consists of three transcriptional 
units: (i) an AraC-repressible promoter 
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Tab. 1 Reaction network of the BioBrick example. 


Protein multimerization 


2 TetR > TetR2 

TetR2 — 2 TetR 

Transcription 

RNAp + BBa_R0040 + tetO2 + tetO1 — RNAp:BBa_R0040:tetO2:tetO1 

RNAp:BBa_R0040:tetO2:tetO1 — RNAp + BBa_R0040 + tetO2 + tetO1 

RNAp:BBa_R0040:tetO2:tetO1 — RNAp:BBa_R0040:tetO2:tetO1* 

RNAp:BBa_R0040:tetO2:tetO1* — RNAp:DNA_GFP_protein + BBa_R0040 + 
tetO2 + tetO1 

RNAp:DNA_GFP_protein — RNAp + mRNA_GFP_protein 

Translation 

rib + mRNA_GFP_protein > rib:mRNA_GFP_protein 

rib:mRNA_GFP_protein — rib:mRNA_GFP_protein_1 + mRNA_GFP_protein 

rib:mRNA_GFP_protein_1— rib + GFP_protein 

Regulation 

TetR2 + tetO1 > TetR2:tetO1 

TetR2:tetO1 — TetR2 + tetO1 

TetR2 + tetO2 — TetR2:tetO2 

TetR2:tetO2 + TetR2 + tetO2 

Induction 

TetR2 + aTc — TetR2:aTc 

TetR2:aTc > TetR2 + aTc 

TetR2:aTc + aTc > TetR2:aTc2 

TetR2:aTc2 — TetR2:aTc + aTc 

TetR2:aTc + tetO1 — TetR2:aTc:tetO1 

TetR2:aTc:tetO1 — TetR2:aTc + tetO1 

TetR2:tetO1 + aTc > TetR2:aTc:tetO1 

TetR2:aTc:tetO1 — TetR2:tetO1 + aTc 

TetR2:aTc2 + tetO1 — TetR2:aTc2:tetO1 

TetR2:aTc2:tetO1 — TetR2:aTc2 + tetO1 

TetR2:aTc:tetO1 + aTc — TetR2:aTc2:tetO1 

TetR2:aTc2:tetO1 — TetR2:aTc:tetO1 + aTc 

TetR2:aTc + tetO2 + TetR2:aTc:tetO2 

TetR2:aTc:tetO2 — TetR2:aTc + tetO2 

TetR2:tetO2 + aTc > TetR2:aTc:tetO2 

TetR2:aTc:tetO2 — TetR2:tetO2 + aTc 

TetR2:aTc2 + tetO2 — TetR2:aTc2:tetO2 

TetR2:aTc2:tetO2 + TetR2:aTc2 + tetO2 

TetR2:aTc:tetO2 + aTc — TetR2:aTc2:tetO2 

TetR2:aTc2:tetO2 + TetR2:aTc:tetO2 + aTc 

Nonspecific DNA interactions 

TetR2 + nsDNA — TetR2:nsDNA 

TetR2:nsDNA — TetR2+nsDNA 

TetR2:aTc + nsDNA —> TetR2:aTc:nsDNA 

TetR2:aTc:nsDNA — TetR2:aTc + nsDNA 

TetR2:nsDNA + aTc— TetR2:aTc:nsDNA 


Kinetic data 


1000 000 000 


30 nt/s, 600 nt 


100 000 
33 
33 aa/s, 220 aa 


1000 000 000 
0.005 
1000 000 000 
0.005 


50.000 000 
0.1 

50.000 000 
0.1 

1000 000 000 
0.7 

1000 000 

0.4 

1000 000 000 
0.7 

50.000 000 
0.1 

1000 000 000 
0.7 

1000 000 

0.4 

1000 000 000 
0.7 

50 000 000 
0.1 


1000 
1.62 
1000 
1.62 
1000 


(continued overleaf) 
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Tab. 1 (Continued) 


Protein multimerization 


TetR2:aTc:nsDNA — TetR2:nsDNA + aTc 

TetR2:aTc2 + nsDNA — TetR2:aTc2:nsDNA 

TetR2:aTc2:nsDNA — TetR2:aTc2 +nsDNA 

TetR2:aTc:nsDNA + aTc — TetR2:aTc2:nsDNA 

TetR2:aTc2:nsDNA — TetR2:aTc:nsDNA + aTc 

Leakiness 

RNAp + BBa_R0040 + TetR2:tetO1 + tetO2 — RNAp:BBa_R0040:tetO2:tetO1:TetR2 

RNAp:BBa_R0040:tetO2:tetO1:TetR2 — RNAp + BBa_R0040 + TetR2:tetO1 + tetO2 

RNAp:BBa_R0040:tetO2:tetO1:TetR2 — RNAp:BBa_R0040:tetO2:tetO1:TetR2* 

RNAp:BBa_R0040:tetO2:tetO1:TetR2* — RNAp:DNA_GFP_protein + BBa_R0040 
+ TetR2:tetO1 + tetO2 

Transport 

— TetR2 

Degradation 

GFP_protein > 

mRNA_GFP_protein > 

TetR2 > 

TetR2:nsDNA — nsDNA 

TetR2:aTc > aTc 

TetR2:aTc:nsDNA — aTc+nsDNA 

TetR2:aTc2 > 2 aTc 

TetR2:aTc2:nsDNA — 2 aTc-+nsDNA 


Kinetic data 


1.62 
1000 
1.62 
1000 
1.62 


0.0166 
0.057 
0.1 

30 


1.00E-10 


0.000289 
0.0015 

0.000289 
0.000193 
0.000289 
0.000193 
0.000289 
0.000193 


*Reactions representing the RNA polymerase holoenzyme complex transitioning from a closed to 


an open state; the asterisk indicates the open state. 
nsDNA, nonspecific DNA. 


Tab. 2. Default species amounts for Designer-generated systems. 


Species type Initial amount Split on cell division 
mRNA 0 Y 
Promoters and operators 1 N 
Proteins 0 Y 
Ribosome 600 N 
RNAp 300 N 
nsDNA 5 000 000 N 
Other 0 N 


nsDNA, nonspecific DNA. 


Tab. 3 Reaction network of the repressilator example. 
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Protein multimerization 


2 LacI — Lacl2 

Lacl2 — 2 Lacl 

2 LacI2 — Lacl4 

Lacl4 — 2 LacI2 

2 TetR — TetR2 

TetR2 — 2 TetR 

2 AraC — AraC2 

AraC2 — 2 AraC 

Transcription 

NAp + araP + aral1 + aral2  RNApzvaraP:aral1:aral2 
NApz:araP:aral1:aral2 — RNAp + araP + arall + aral2 
NApz:araP:aral1:aral2 — RNAp:araP:aral1:aral2* 


R 

R 

R 

R 

RNAp:DNA_LacI — RNAp+ mRNA_Lacl 
RNAp + lacP + lacOl > RNAp:lacP:lacO1 
RNAp:lacP:lacO1l > RNAp + lacP + lacO1 
RNAp:lacP:lacO1 — RNAp:lacP:lacO1* 
RN. 
RN 
RN 
R 
R 
R 
R 


Ap:lacP:lacO1* — RNAp:DNA_TetR + lacP + lacO1 
Ap:DNA_TetR > RNAp+mRNA_TetR 

Ap + tetP + tetO2 — RNAp:tetP:tetO2 
NAp:tetP:tetO2 — RNAp + tetP + tetO2 
NAp:tetP:tetO2 — RNAp:tetP:tetO2* 
NAp:tetP:tetO2* — RNAp:DNA_AraC + tetP + tetO2 
NAp:DNA_AraC + RNAp+ mRNA_ AraC 
Translation 

rib +mRNA_LaclI > rib:mRNA_Lacl 

rib:mRNA_LaclI — rib:mRNA_LacI_1+mRNA_Lacl 
rib:mRNA_LacI_1— rib + Lacl 

rib + mRNA_TetR = rib:mRNA_TetR 
rib:mRNA_TetR > rib:mRNA_TetR_1 + mRNA_TetR 
rib:mRNA_TetR_1 > rib+TetR 

rib +mRNA_AraC > rib:mRNA_AraC 


rib:mRNA_AraC —> rib:mRNA_AraC_1+ mRNA_AraC 


rib:mRNA_AraC_1— rib+ AraC 
Regulation 

Lacl4 + lacO1 — Lacl4:lacOl 
Lacl4:lacO1 + Lacl4+lacOl 
TetR2 + tetO2 — TetR2:tetO2 
TetR2:tetO2 + TetR2 + tetO2 
AraC2 + aral1 — AraC2:aral1 
AraC2:aral1 — AraC2 + aral1 
AraC2 + aral2 — AraC2:aral2 
AraC2:aral2 — AraC2 + aral2 


NApz:araP:aral1:aral2* — RNAp:DNA_LacI + araP + arall1 + aral2 


Kinetic data 


1000 000 000 
0 
1000 000 000 
0 
1000 000 000 
0 
1000 000 000 
0 


0.0166 

0.057 

0.01 

30 

30 nt/s, 600 nt 
0.0166 

0.057 

0.01 

30 

30 nt/s, 600 nt 
0.0166 

0.057 

0.01 

30 

30 nt/s, 600 nt 


100 000 

33 

33 aa/s, 220 aa 
100 000 

33 

33 aa/s, 220 aa 
100 000 

33 

33 aa/s, 220 aa 


1000 000 000 
0.005 
1000 000 000 
0.005 
1000 000 000 
0.005 
1000 000 000 
0.005 


(continued overleaf) 
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Tab. 3 (Continued) 


Protein multimerization 


Nonspecific DNA interactions 
Lacl4+ nsDNA => Lacl4:nsDNA 
Lacl4:nsDNA — LacIl4+ nsDNA 
TetR2 + nsDNA — TetR2:nsDNA 
TetR2:nsDNA — TetR2 +nsDNA 
AraC2 + nsDNA - AraC2:nsDNA 
AraC2:nsDNA — AraC2 + nsDNA 
Degradation 

Lacl4 > 

mRNA_LaclI > 

Lacl4:nsDNA > nsDNA 

TetR2 > 

mRNA_TetR > 

TetR2:nsDNA > nsDNA 

AraC2 > 

mRNA_AraC > 

AraC2:nsDNA > nsDNA 


Kinetic data 


1000 
1.62 
1000 
1.62 
1000 
1.62 


0.000289 
0.0015 
0.000193 
0.000289 
0.0015 
0.000193 
0.000289 
0.0015 
0.000193 


*Reactions representing the RNA polymerase holoenzyme complex transitioning from a closed to 


an open state; the asterisk indicates the open state. 


nsDNA, nonspecific DNA. 


| = Lacl —_ _ —1 — 


Fig. 6 Schematic of a lac-tet-ara repressilator. Dlac, Dtet, 
and Dara refer to DNA coding regions corresponding to Lacl, 


TetR, and AraC, respectively. 


(araP) controlling a Lacl coding region 
(DNAlac); (ii) a Lacl-repressible promoter 
(lacP) controlling a TetR coding region 
(DNAtet); and (iii) a TetR-repressible pro- 
moter (tetP) controlling an AraC coding 
region (DNAara). There are no effector 
molecules or additional proteins in the 
system. Overall, the lac-tet-ara ‘‘repressi- 
lator” features three different fluorescent 
proteins aiming to quantify the amount of 
the three repressor proteins (Lacl, TetR, 
AraC) at any given time [65]. 


To the present authors’ knowledge, this 
device does not exist in the Parts Registry 
as a BioBrick device. Nevertheless, it is 
fully possible for a user to recreate such 
large and complex networks in Designer 
from scratch. This process is outlined 
below. 


3.2 
Inserting and Modifying User-Defined Parts 


As this example network does not involve 
BioBricks, it is necessary to add and 


specify each part manually. This includes 
the aforementioned promoters and coding 
regions, as well as several RBS and termi- 
nators (term). For valid Designer input, 
all transcriptional units must include an 
RBS between each promoter and coding 
region, and one or more terminators 
after each coding region (or series of 
coding regions, if polycistronic). Hence, 
the full list of parts in this system is: 
araP—RBS-—DNAlac—term—lacP—RBS-— 
DNAtet—term-—tetP—RBS—DNAara-term. 
Each part is added by entering its name 
and selecting its type in the “Add Custom 
BioBrick” subsection on the first page 
of Designer’s interface. This is shown in 
Fig. 7. 

Additional information must be speci- 
fied for each promoter and coding region. 
For promoters, the user must introduce 
operator sites. As described by Tuttle et al. 
[65], araP has two operator sites: aral1 
between —35 and —10; and aral2 down- 
stream of —10. To add a site, select a pro- 
moter’s tab, enter an operator site’s name 
and location, and click “Add Operator” 


wx 


Add Load 


Biobrick search 
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as shown in Fig. 8. Other operator sites 
in the system include lacO1 between —35 
and —10 on Plac, and tetO2 between —35 
and —10 on Ptet. Please note that all pro- 
moters in this system are constitutively 
“ON.” 

As for coding regions, a protein and 
its type must be specified for each. In 
this example, DNAlac corresponds to Lacl, 
DNAtet corresponds to TetR, and DNAara 
corresponds to AraC. All of these proteins 
are repressors. 


3.3 
Protein Input, Specifics, and Generation of 
Output 


There are no constitutively expressed pro- 
teins in this system, but protein complex 
and DNA-binding information must be 
added for Lacl, TetR, and AraC. Specif- 
ically, Lacl forms a tetramer and binds 
lacO1, TetR forms a dimer and binds 
tetO2, and AraC forms a dimer and binds 
arall and aral2 (see Sect. 2.3 for details on 
how this information would be input into 


Help 


Biobrick ID: 


‘Search 


Add custom Biobrick 


Name: | araP 


Coding DNA search 


| Biobrick type: {Promoter | 


Promoter 
RBS 
Coding DNA 


Protein name: | 


] Searc/Terminator 


Fig. 7. Adding a custom part. 
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BO? Ooms 


Add Load Help araP 


Specifics 

Type: Promoter 
Constitutively: ON 
Operator sites: 


No operator sites: 


Add operator 
Operator name: [ara —=«@di(Opperattor location: 


Shift brick right f Delete brick 


Fig. 8 Adding operator sites to a promoter. 


Current proteins 


Protein Type Complex Binds 


Lacl Repressor Lacl4 lac01 
TetR Repressor TetR2 tet02 
AraC Repressor AraC2  aral1 aral2 


Fig. 9 Summary of protein specifics 
for the repressilator. 


Designer). The “Current Proteins” table 
for this example is shown in Fig. 9 as a 
summaty. 

As discussed in the previous example, 
after specifying protein information, the 
next step is to add effectors. However, this 
example involves no effector molecules, 
and thus Designer already has enough 
information to generate a reaction network 
as a NetCDF or SBML file. The general 
contents of Designer output files are 
outlined in Sect. 2.5. The reaction network 
generated for this specific example is 
provided in Table 3. 


4 
Current Limitations of the Designer 


Although SynBioSS Designer greatly sim- 
plifies the generation of reaction network 
models of synthetic constructs, it currently 


Upstream of—35 
Upstream of—35 


Between—-35 and—10 


Downstream of —10 


DNAlac term lacP 


* 


faces important limitations, three of which 
are as follows [48]: 


e The first shortcoming of Designer is 
related to the selection of kinetic para- 
meters used in the generated reaction 
networks. In these networks, Designer 
exploits characteristic kinetic parame- 
ters from well-studied systems such as 
the tetracycline and lactose operon. Even 
though these kinetic parameters provide 
a good estimation about the kinetics of 
the studied systems, they are not always 
representative and may yield imprecise 
predictions. A continuing goal is to im- 
prove the connection between SynBioSS 
Wiki and Designer so that the latter au- 
tomatically retrieves information from 
the former and assigns them to the de- 
veloped reaction networks. 

e A second obstacle stems from the ass- 
umption that all the required elements 
are available and correctly annotated 
in the Parts Registry. This assumption 
does not always hold, as there exist 
several BioBricks that do not correctly 
identify specific components, such as 
promoter regions and RBS. In this case, 
Designer returns an error and the user 
then needs to manually introduce these 


components and designations in Des- 
igner. In other words, the first output 
of Designer relies on the integrity of 
BioBrick information in the Registry. 
However, there is always the alternative 
for the user to add, delete, or modify 
BioBrick parts through the Designer 
interface manually. 

e The third restraint in the use of De- 
signer is that it stringently follows the 
central dogma of molecular biology. In 
other words, it only accounts for the 
processes involved in DNA replication, 
DNA transcription and RNA translation. 
Even though these processes govern 
gene expression dynamics, there are 
also secondary mechanisms engaged in 
gene expression that are currently ne- 
glected by the Designer, such as RNA 
silencing and antisense regulation. The 
current version of Designer does not 
take such mechanisms into considera- 
tion and only accounts for the funda- 
mental steps underlying the molecular 
biology dogma. Continued development 
may incorporate these additional molec- 
ular biology mechanisms in the toolbox 
of SynBioSS. 


5 
Conclusions 


Modeling may accelerate the design of syn- 
thetic biological systems by circumventing 
costly and time-consuming experimental 
trial-and-error approaches. As a result, the 
field of synthetic biology has witnessed an 
increasing interest in computational tools, 
which model and simulate the behavior of 
synthetic systems. 

Numerous efficient computational 
tools with multiple functions have been 
launched during the past few years. One 
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in particular, the SynBioSS Designer, is a 
relatively new software package where the 
user simply inputs the molecular parts 
of a synthetic system and the software 
outputs a reaction network modeling 
the system. What differentiates Designer 
from many other computational tools is its 
connection with the BioBricks database. 
By using SynBioSS Designer, the user 
can pick any molecular component from 
the Registry of Parts, the BioBricks 
database. Using BioBricks renders the 
experimental design of robust synthetic 
systems more accessible, as BioBricks 
are well-characterized molecular parts 
and their function is well established. 
In addition, the behavior of integrated 
synthetic systems built by various 
BioBricks may be predicted in silico with 
a greater accuracy, as there is plenty of 
information available concerning the 
kinetics of these parts. 

In summary, SynBioSS Designer can 
automatically generate reaction networks 
from BioBricks in a plug-and-play fashion. 
Even though its use has currently some 
limitations, improved versions of the De- 
signer may assist in the in silico design of 
synthetic systems. 
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Keywords 


Analog circuit: 
a circuit that produces a graded response based on the levels of inputs it receives. Analog 
circuits can have a graded transfer function over a wide range of input concentrations. 


Digital circuit: 

A circuit that produces an all-or-none response, depending on whether its inputs are 
above or below a defined threshold level. Digital circuits include digital logic gates. 
Complex circuits may include layered logic gates, so that the output of one gate provides 
the input for another. 


Logic gate: 
A device that accepts inputs in the form of TRUE and FALSE (also represented as 1 
and 0, respectively) and returns a single TRUE or FALSE output based on Boolean logic 


Synthetic Gene Circuits 


operations such as AND NOT OR, and others. For example, an AND gate returns the 
TRUE output if, and only if, all inputs are TRUE. 


Orthogonality: 

The ability of circuit components to function in the same cell without crosstalk; for 
example, two transcription factors that bind distinct DNA motifs, or RNA molecules 
that regulate distinct transcripts. Orthogonal components are vital for building complex 
synthetic circuits. 


Oscillator: 
A circuit that cycles repeatedly between states, such as high and low levels of expression 
of a particular protein; one of the prototypical examples of a synthetic gene circuit. 


Synthetic transcription factor (sTF): 

A human-made transcription factor designed to regulate transcription from a specific 
promoter; often designed to be responsive to an input, such as a small molecule. May 
include domains from naturally occurring transcription factors, such as a transactivation 
or ligand-binding domain, and rationally designed motifs, such as sequence-specific 
DNA-binding zinc fingers. 


Toggle switch: 

A circuit that can exist in one of two stable states and may be switched (toggled) between 
the two states by a defined input; along with the oscillator, the toggle switch is a classic 
example of a proof-of-principle synthetic gene circuit. 


Transfer function: 

The output level of a gene circuit as a function of the input level(s) (e.g., the activity of 
the fluorescent reporter output as a function of the concentration of a small molecule 
input). 


Tunability: 
The ability to adjust the activity level of a synthetic circuit component, such as the 
strength of gene expression from a synthetic promoter. 


The past decade has witnessed tremendous advances in the design and imple- 
mentation of synthetic gene circuits that program living cells to perform specific 
user-defined tasks. Synthetic circuits have been implemented in bacteria, yeast, 
and mammalian cells, using a variety of transcriptional and post-transcriptional 
regulatory mechanisms. These devices, which lie at the intersection of biology 
and engineering, have provided insights into the function of naturally occurring 
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gene regulatory networks. Furthermore, they hold the potential for transformative 
future applications in medicine, bioremediation, manufacturing, and more. In 
this review, some examples are presented of commonly used synthetic circuits, 
including oscillators, switches, memory devices, and circuits that perform digital 
and analog computation. The building blocks of synthetic gene circuits, as well as 
the challenges and considerations of circuit design, are also discussed. Finally, an 
overview is provided of the potential practical applications of this dynamic field of 


research. 


1 
Introduction to Synthetic Gene Circuits 


Living cells monitor their environment 
and respond to a variety of inputs with so- 
phisticated behaviors, including changes 
in gene expression, cell morphology and 
motility, regulation of the cell cycle and 
growth, and protein secretion. The genetic 
circuits that control these behaviors have 
been selected over evolutionary time to 
enhance the fitness of the cell (or mul- 
ticellular organism). The emerging and 
rapidly expanding discipline of synthetic 
biology aims to engineer synthetic ge- 
netic circuits that perform user-defined 
functions in a predictable and reliable 
manner. 

The synthetic biology approach is mod- 
eled explicitly after other engineering dis- 
ciplines, such as electrical and mechanical 
engineering. Like other forms of engineer- 
ing, synthetic biology aims to adopt differ- 
ent levels of abstraction for design, from 
a very general, high-level description of 
desired circuit behavior (e.g., “design a cir- 
cuit that recognizes and kills cancer cells”), 
to a detailed description of the molecular 
mechanisms used to implement the de- 
sired behavior, including the sequences of 
all the DNA-based components. 

The motivations for building synthetic 
gene circuits include both discovery and 
applications. Early examples of synthetic 
circuit design include relatively small and 


simple genetic devices that were used 
to study the principles underlying the 
behavior of gene regulatory networks (for 
reviews, see Refs [1-4]). More recent 
studies have rewired the native regulatory 
pathways of living cells in order to study 
the design principles of gene networks in 
a “learn by design” approach [5], as well as 
to gain insights into complicated processes 
such as malignant transformation [6]. 
Meanwhile, many synthetic biologists are 
aiming to build circuits with possible 
practical applications in areas such as 
medicine, environmental bioremediation, 
the manufacture of biofuels and valuable 
chemicals, and biological computation 
(1, 7]. 

The review begins with a summary 
of the building blocks of synthetic gene 
circuits (Sect. 2). In Sections 1-7 are pre- 
sented a number of commonly used and 
studied types of synthetic gene circuit, in- 
cluding oscillators, toggles and cascades 
(Sect. 3); memory devices (Sect. 4); dig- 
ital and analog circuits (Sects. 5 and 6); 
and multicellular systems (Sect. 7). While 
synthetic biology has progressed greatly 
during the past few years, significant chal- 
lenges remain, and these are discussed 
briefly in Sect. 8. The review concludes 
with a survey of applications of synthetic 
circuits (Sect. 9) that provide an overview of 
the potential for the tremendous advances 
in technology and medicine offered by this 
burgeoning field of research. 


2 
Building Blocks of Synthetic Gene Circuits 


A genetic circuit often consists of three 
parts: (i) a sensor, which accepts an input 
or inputs; (ii) a processor, which com- 
putes the desired response to the input(s); 
and (iii) an actuator, which produces the 
corresponding output. The function of a 
synthetic gene circuit depends on its build- 
ing blocks (synthetic DNA, RNA, and pro- 
teins), as well as the way these components 
are wired together into sensor, processor, 
and actuator modules. The building blocks 
of synthetic gene circuits may be rationally 
designed or harvested from Nature, some- 
times accompanied by directed evolution 
to alter their performance in a desired way. 

In this section, the choice of host cells 
for implementing synthetic circuits (the 
chassis), circuit inputs and outputs is 
discussed, as well as the molecular im- 
plementation of the circuits themselves. 
Attention is focused mainly on transcrip- 
tional control, which has been used exten- 
sively in synthetic gene circuits; RNA- and 
protein-based approaches are also briefly 
discussed. Considerations of circuit topol- 
ogy are provided in Sect. 8. 


2.1 
The Chassis: Choice of the Host Cell 


Almost all synthetic gene circuits devel- 
oped to date have been implemented in 
the bacterium Escherichia coli, the budding 
yeast Saccharomyces cerevisiae, or mam- 
malian cells. Each of these hosts presents 
a unique set of advantages and challenges, 
as well as distinct host—circuit interactions 
that must be considered when designing 
circuits for operation in living cells. 

Many of the earliest synthetic circuits 
were implemented in the model bacterium 
E. coli [8, 9] (see Sect. 3). The advantages of 
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E. coli include its relatively small genome, 
extensive toolbox for genetic manipula- 
tion, rapid and easy growth characteristics, 
and simple and well-understood manner 
of transcriptional regulation. In addition 
to providing a fertile testing ground for 
proof-of-principle genetic circuits, E. coli 
is also an interesting organism for practi- 
cal applications in bioremediation; for the 
manufacture of biofuels, pharmaceuticals, 
and other valuable chemicals; and for hu- 
man health (e.g., identifying mechanisms 
to combat antibiotic-resistant pathogenic 
bacteria, or engineering bacteria to find 
and destroy cancer cells). 

The budding yeast S. cerevisiae presents 
an excellent model system for designing 
synthetic circuits in eukaryotes (for a re- 
view, see Ref. [10]). Like E. coli, S. cerevisiae 
is quick and easy to grow in the labora- 
tory, and it offers a well-developed suite 
of genetic tools, including the ability to 
maintain foreign genetic elements stably 
on plasmids and to achieve an efficient ho- 
mologous recombination of synthetic con- 
structs into the genome. At the same time, 
S. cerevisiae offers a variety of useful char- 
acteristics for designing logic circuits that 
bacteria lack, including intracellular com- 
partmentalization and a rich regulatory 
repertoire with complex transcriptional 
regulation and protein signaling cascades. 
Therefore, yeast can be thought of as a 
“testing platform” for synthetic biology 
approaches that can then be adapted to 
more complicated mammalian cells [11]. 
S. cerevisiae is also a workhorse of indus- 
trial synthetic biology, and there is great 
interest in programming yeast strains for 
improved production of biofuels and com- 
modity chemicals [12]. 

Mammalian cells are highly desirable 
targets for synthetic circuit engineering, 
due to a myriad possible applications such 
as therapeutics and tissue engineering 
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[13]. However, these cells have highly 
complex functions, gene regulation, and 
intercellular interactions. Furthermore, it 
is more difficult to work with mam- 
malian cells compared to microorganisms 
in many technical aspects, such as culture 
maintenance, DNA delivery, and experi- 
mental turnover time. Thus, compared to 
yeast and bacteria, mammalian cells pose 
a significantly greater challenge for syn- 
thetic biologists. Nonetheless, the past few 
years have seen many advances in pro- 
gramming circuits in mammalian cells, as 
described below. Popular mammalian cell 
types that are well adapted to laboratory 
conditions include HeLa (human cancer), 
HEK293 (human embryonic kidney), and 
CHO (Chinese hamster ovary) cells. 


2.2 
Inputs and Outputs of Synthetic Circuits 


Many synthetic gene circuits are designed 
to perform a specific action in response to 
a defined input. Commonly used inputs 
during circuit design are convenient small 
molecules that cells can be engineered to 
respond to, including antibiotics such as 
anhydrotetracycline (aTc); metabolites or 
their analogs, such as arabinose or iso- 
propyl f-p-thiogalactopyranoside (IPTG), 
which mimics a lactose metabolism in- 
termediate; and acyl homoserine lactones 
(AHLs), diffusible molecules that bacte- 
ria use to communicate with each other 
(see Sect. 7.1) [14-16]. In Nature, bacte- 
ria respond to such inputs by activating 
or repressing target gene expression; for 
example, arabinose triggers the transcrip- 
tion of genes that encode enzymes and 
transporters needed to utilize this sugar 
[17]. Synthetic biology rewires the cell’s 
response to these small-molecule inputs. 
Cells can also trigger gene expression in 
response to an external stressor, such as a 


pulse of heat or DNA-damaging ultraviolet 
radiation, and these stimuli have been 
used as inputs to synthetic gene circuits 
[9, 18]. Their disadvantage, however, is that 
they can stimulate wide-ranging responses 
in the host cell that may interfere with de- 
sired function of the circuit, and prolonged 
or repeated exposure may kill the host cell. 

More recently, various research groups 
have harnessed specific wavelengths of 
light as inputs into synthetic gene circuits, 
for example, by using light-sensitive 
proteins from photosynthetic or 
light-sensitive organisms, including 
cyanobacteria and plants [19, 20]. One of 
the advantages of light over a diffusible 
small molecule is the exquisite level of 
spatiotemporal control that it offers; it is 
possible to shine a light specifically onto 
one part of a plate covered with engineered 
cells. Moreover, light can be switched on 
and off instantly, unlike a small molecule 
which, once applied, can only be removed 
through washing or gradual dilution. The 
field of controlling cell behavior with 
light, named optogenetics, has undergone 
explosive growth during the past few years 
(for a review, see Ref. [21]). 

While many proof-of-principle synthetic 
circuits use exogenous small molecules 
or other inputs that are easy to measure 
and apply, synthetic biology also aims to 
create circuits that respond to endogenous 
and functionally relevant inputs, such 
as disease markers [22-24]. Designing 
a circuit to respond to such input is 
challenging, particularly because of the 
challenge in finding or designing relevant 
sensors [13, 22-24]. However, a number of 
circuits responsive to endogenous inputs 
have been constructed (see Sect. 9). 

The outputs of synthetic gene circuits 
are frequently fluorescent reporter 
proteins, because such outputs are 
easy to detect and quantify, enabling 


characterization of circuit performance 
and dynamics. Flow cytometry allows 
several different fluorescent species to be 
quantified per cell in a high-throughput 
manner, and time-lapse microscopy 
allows the observation of changes in 
output levels in response to defined 
inputs in single cells over time. Hence, 
fluorescent proteins are frequently used 
in proof-of-principle demonstrations 
as well as for troubleshooting circuits; 
subsequently, once the circuit displays a 
desired performance based on fluores- 
cence assays, the reporter genes may be 
replaced by output genes that perform a 
desired function. 

Future circuits will increasingly aim to 
produce outputs relevant for industrial or 
therapeutic applications, such as control 
over the cell’s proliferation, survival, differ- 
entiation, morphogenesis, migration, or 
synthesis and secretion of a therapeutic 
protein or commodity chemical. Examples 
of circuits with functionally relevant out- 
puts are described in Sect. 9. 


23 
Properties of Synthetic Building Blocks 


The function ofa synthetic circuit depends 
upon the function of its components (for 
reviews, see Refs [1] and [2]). Synthetic 
building blocks, such as transcription 
factors (TFs) or RNA regulatory elements, 
should ideally be: 


e Modular: The component should have a 
defined function that persists regardless 
of context; for example, a promoter 
drives the expression of a downstream 
gene at the same level, regardless of 
the identity of the downstream gene. 
Modularity is the property that allows 
basic building blocks to be assembled 
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into complex devices in a predictable 
manner. 

e Composable: The components should 
be arranged together into a functional 
circuit. The output of one part of the 
circuit can serve as the input for a 
downstream part. 

e Orthogonal: The components should 
avoid unwanted crosstalk with each 
other, or with the host cell’s molecular 
machinery. Unlike an electronic device, 
in which components are connected by 
physical wires, a gene circuit consists 
largely of building blocks that freely dif- 
fuse and mix inside the cell, creating 
the potential for unwanted interactions. 
For example, if a circuit includes the 
TFs A and B that regulate promoters 
Pa and Px, respectively, then the un- 
wanted regulation of P, by B may lead 
to circuit failure. Likewise, unplanned 
interactions between a synthetic factor 
and the cell’s native genes may disrupt 
circuit function or harm the host cell. 

e Tunable: Different circuit functions re- 
quire components with different levels 
of activity, such as promoters that drive 
expression of downstream genes at dif- 
ferent levels. Moreover, successful cir- 
cuit design requires level matching be- 
tween upstream and downstream parts 
of the circuit (see Sect. 8.1). Hence, syn- 
thetic biology requires components that 
display tunability — the ability to achieve 
different output levels through small 
changes, such as point mutagenesis of 
a promoter or a RNA regulatory device. 


Orthogonality and tunability require the 
construction of large libraries of each type 
of circuit component, such as promoters, 
TFs, ribosome-binding sites (RBSs), regu- 
latory RNA molecules, and others. Efforts 
have been made to construct libraries of 
orthogonal parts with reduced unwanted 
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crosstalk [25-27]. Libraries of building 
blocks may be obtained through a com- 
bination of three approaches [28]: (i) parts 
mining, which involves searching the an- 
notated genomes of different species for 
components that are predicted to have the 
desired function while avoiding crosstalk 
with the host cell; (ii) directed evolution, 
whereby a library of mutagenized com- 
ponents is constructed and subjected to 
selection in vitro or in vivo in order to 
identify components with desired activity; 
and (iii) rational design, which is appli- 
cable to the types of device that operate 
according to well-understood rules. For 
example, synthetic transcription factors 
(sTFs) based on transcription activator-like 
effectors (TALEs) offer a way to obtain or- 
thogonal TFs by rational design [29] (see 
below). 

It is possible to use a combination of 
rational design and directed evolution to 
achieve tunability and to produce libraries 
of components with the same function, but 
with different activity levels. For example, 
the expression level of a protein depends 
partly on the sequence of the RBS on 
its corresponding mRNA; hence, protein 
expression levels can be adjusted through 
point mutations in the RBS. The RBS 
calculator [30] is a program for rationally 
designing an RBS sequence that results 
in a user-defined protein expression level. 
Similarly, promoters can be tuned via 
point mutations [31]. In addition, some 
circuits are designed to be tunable in vivo 
through the addition of small-molecule 
inputs. For example, a bandpass filter has 
been designed that allows its host cells to 
survive only at a specific concentration of 
two antibiotics, ampicillin and tetracycline 
[32]. Varying the levels of the inducer 
IPTG changes the target concentration of 
ampicillin and tetracycline at which cells 
can survive by adjusting the expression of 


the bandpass filter’s components [32] (see 
Sect. 3.4). 

Below are described several of the 
most commonly used families of synthetic 
circuit components (also see Table 1 in 
Ref. [1]). Some examples of commonly 
used synthetic circuit building blocks are 
shown in Fig. 1. 


2.4 
Building Blocks of Synthetic Transcriptional 
Regulation 


Transcription is a key mechanism for 
gene regulation in living cells. In its 
most basic form, a transcriptional unit 
consists of a TF and its cognate pro- 
moter/regulator. The TF binds its regula- 
tor region in a sequence-specific manner 
and activates or represses its transcrip- 
tion, in some cases in response to an input 
such as a small molecule. The modular 
nature of eukaryotic TFs enables the con- 
struction of sTFs by combining different 
DNA-binding, ligand-binding, and regula- 
tory domains [26, 31, 33-36]. Some tran- 
scriptional regulatory domains, such as 
the commonly used viral VP16 activation 
domain or its derivative, the VP64 domain 
[37], activate gene expression from their 
target promoter. In this case, the cognate 
synthetic promoter is either a very weak or 
a minimal promoter located downstream 
of the TF recognition site, and transcrip- 
tion is initiated from this promoter in the 
presence of the TF. Other transcriptional 
regulatory domains inhibit gene expres- 
sion, such as the Kritppel-associated box 
(KRAB) domain found in vertebrates [38], 
which inhibits transcription by recruit- 
ing chromatin-modifying proteins that 
cause formation of repressive heterochro- 
matin. In this case, the cognate synthetic 
promoter is usually a strong constitutive 
promoter with associated TF recognition 
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Fig. 1 Examples of commonly used synthetic 
building blocks. (a) Transcriptional regulation 
is the backbone of many synthetic circuits (see 
Sect. 2.4). Top: The bacterial TetR transcrip- 
tional repressor forms homodimers that bind 
tightly to their DNA target motifs, sterically 
blocking RNA polymerase recruitment to the 
PtetO promoter. Binding of the small molecule 
anhydrotetracycline (aTc) causes TetR to dis- 
sociate from DNA, allowing expression of the 
downstream gene of interest (goi). TetR-based 
transgene regulation is commonly used in E. 
coli and has also been adapted for use in eu- 
karyotes. Bottom: Transcription Activator-Like 
Effectors (TALEs) bind to user-defined DNA 
sequences. Here, a synthetic TALE-TF is used 
to activate a goi in mammalian cells in re- 
sponse to blue light [39]. Upon illumination, 
the light-sensitive CRY2 domain fused to the 
TALE-TF undergoes a conformational change, 
resulting in recruitment of its partner CIB1 
domain along with a transcriptional activator 
domain (AD) that stimulates target gene ex- 
pression. Lower panel adapted from Ref. [39]; 
(b) RNA-based devices can regulate target 
gene expression at different levels, including 
transcriptional elongation, translation initia- 
tion, and mRNA stability (see Sect. 2.5). In 
this example, an aptamer (ligand-binding RNA 
molecule) is fused to an mRNA encoding a 
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transgene of interest. In the absence of lig- 
and (top), the aptamer folds into a stem-loop 
structure that blocks the ribosome binding site 
(RBS), preventing mRNA translation. The lig- 
and (L) causes a conformational change in 
the aptamer, exposing the RBS and allow- 

ing translation (bottom). Adapted from Ref. 
[40]; (c) Regulated protein degradation can be 
used to control synthetic circuit activity at a 
post-translational level (see Sect. 2.7). Here, 
an ssrA tag fused to the protein of interest 
(Poi) targets the protein for degradation by 
the ClpXP protease. Adapted from Ref. [41]; 
(d) Synthetic fusion proteins can “rewire” 
intracellular signaling cascades. Here, a scaf- 
fold protein brings together three protein 
kinases (A, B, and C), resulting in efficient 
signal propagation in an S. cerevisiae Mito- 
gen Activated Protein Kinase (MAPK) pathway. 
This naturally occurring system is “rewired”’ 
by fusing the scaffold to a leucine zipper do- 
main, which recruits a protein phosphatase 
(Phos) to the signaling complex [42]. The 
phosphatase attenuates MAPK signaling by 
dephosphorylating and hence inactivating the 
MAP kinase. In contrast, the recruitment of an 
activating protein to the scaffold potentiates 
MAPK signaling (not shown) [42]. Adapted 
from Ref. [42]. 
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sites that is active by default until silenced 
by the TF. Depending on its mechanism 
of action, the same sTF may activate or 
repress transcription of a target gene, de- 
pending on where in the gene it binds 
[43]. 

In bacteria, transcriptional repressors 
downregulate gene expression at close 
range, via directly interfering with RNA 
polymerase (RNAP) or activator bind- 
ing [44]. Transcriptional repression in 
S. cerevisiae also tends to occur over 
short distances due to the compact na- 
ture of the genome, and may involve 
interference with the recruitment of 
basal transcriptional machinery as well 
as chromatin-mediated silencing [45, 46]. 
In contrast, mammalian repressors of- 
ten involve chromatin remodeling and 
can have long-range effects of up to tens 
of kilobases both upstream and down- 
stream of their binding sites [47]. Thus, 
transgenes inserted into the mammalian 
genome may become silenced by repres- 
sive chromatin spreading from adjacent 
loci, a phenomenon that may be countered 
by flanking the transgene with insulator 
sequences [48]. Furthermore, chromatin 
remodeling is time-consuming, resulting 
in slow kinetics of transcriptional regula- 
tion in mammalian cells [49]. 

As mentioned in Sect. 2.3, one of 
the key requirements of good synthetic 
building blocks is orthogonality: the ability 
of multiple elements to exist in a single cell 
without crosstalk. The regulatory domain 
can be the same among multiple sTFs 
operating in parallel without impairing 
orthogonality. However, off-target binding 
of the sTF to DNA might result in an 
unwanted deregulation of native genes and 
shunting of the sTF away from its target 
promoter. Thus, the DNA-binding domain 
of an sTF must be highly specific and 
target DNA sequences that are very rare or 


absent in the host genome. This is a greater 
challenge for mammalian cells than for 
bacteria and yeast, due to the difference in 
genome sizes (3 x 10° bp in humans [50] 
compared to ~5 x 10° in E.coli [51] and 
~12.5 x 10° in S. cerevisiae [52]). 

The development of sTFs has oc- 
curred in stages, with each new gen- 
eration of sTFs providing new ways 
to improve orthogonality and tunability. 
The “first-generation” synthetic transcrip- 
tional circuits repurposed TFs that occur 
naturally in bacteria or in phages (viruses 
that infect bacteria). Examples include 
the Lacl, TetR, and A CI TF-promoter 
pairs [8, 9], which were initially used to 
build synthetic circuits in bacteria and 
later optimized for use in eukaryotic cells 
(Fig. 1a). Transcriptional repression by 
LacI and TetR is relieved by binding to 
the small molecules IPTG and aTc, re- 
spectively; hence, these TFs can be used 
to build synthetic gene circuits that re- 
spond to small-molecule inputs [8, 9]. 
Early experiments in mammalian cells uti- 
lized the DNA-binding domain of the S. 
cerevisiae GAL4 TF, which recognizes an 
upstream activating sequence (UAS) [53]. 
The GAL4 DNA-binding domain fused to 
the VP16 transcription activation domain 
(VP16AD) can activate gene expression 
from a UAS-containing cognate target pro- 
moter [37]. This useful system exhibits 
tight regulation and a wide dynamic range. 

In order to control the level of tran- 
scription output in response to a small- 
molecule input, ligand-inducible systems 
were further developed in mammalian 
cells. The most-characterized system 
is based on the bacterial tetracycline- 
dependent transactivator (tTA) that, in 
its native form, binds its target DNA 
motif in the absence of tetracycline. Thus, 
a tTA-VP16AD fusion was utilized for 
the construction of a tetracycline-off 


system, which activates its target gene 
in absence of tetracycline [38]. Later, 
tTA was mutated to generate a reverse 
tetracycline-dependent transactivator 
version (rtTA), which binds its target site 
only in the presence of the tetracycline 
analog doxycycline (Dox), constituting 
a tetracycline-on system when fused to 
VP16AD [54]. 

An alternative ligand-responsive tran- 
scriptional regulatory system was based on 
mammalian nuclear hormone receptors, 
TFs that regulate target genes in response 
to steroid hormones such as estrogen and 
progesterone, which diffuse inside the tar- 
get cell and bind to the hormone receptor’s 
ligand-binding domain (LBD), enabling 
regulation of the target gene [55]. The 
simplicity of such systems, which do not 
require an elaborate signal-transduction 
pathway inside the host cell, make LBDs at- 
tractive for engineering ligand-responsive 
sTFs by fusing LBDs to DNA-binding and 
regulatory domains of choice for use in 
mammalian cells [56-58] and in S. cere- 
visiae [59]. To avoid crosstalk with the 
mammalian host cell’s native hormone 
signaling, LBDs have been engineered 
to respond specifically to synthetic com- 
pounds, such as the progesterone an- 
tagonist RU486 [57]. Alternatively, LBDs 
responsive to insect hormones such as 
ecdysone may be used [56]. Additional 
ligand-dependent mammalian sTFs have 
been developed to respond to a variety of 
ligands such as macrolides [60] and strep- 
togramins [61], L-arginine [62], biotin [63], 
urate [24], and others [64]. 

Although well-characterized TFs based 
on naturally occurring proteins such 
as bacterial TetR or yeast GAL4 have 
many advantages, generating a library 
of sTFs sufficient for building com- 
plex gene networks requires another ap- 
proach. The “second-generation” sTFs 
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have custom-designed DNA-binding do- 
mains that can bind any user-defined DNA 
sequence with high specificity. The first 
attempt to develop such sTFs focused on 
engineering the DNA-binding domain of 
zinc finger transcription factors (ZF-TFs) 
[33, 65, 66]. A zinc finger is a small protein 
domain of about 30 amino acids that rec- 
ognizes a specific 3 bp DNA sequence [67]. 
A synthetic protein that includes a tandem 
array of zinc fingers will specifically bind 
to a user-defined sequence in the genome 
[68], circumventing the need to rely on the 
small number of DNA-binding domains 
found in naturally occurring TFs. Pioneer- 
ing studies demonstrated that synthetic 
ZF-TFs can regulate endogenous human 
genes in a sequence-specific manner [69]. 
More recently, suites of orthogonal ZF-TFs 
have been constructed for use in synthetic 
gene circuits in mammalian cells [70] and 
in S. cerevisiae [26]. The design of custom 
ZF-TFs was facilitated by publicly avail- 
able online tools such as the Zinc Finger 
Targeter (ZiFiT) that identifies potential 
ZF-binding sites in user-supplied DNA 
sequences [71], as well as the publicly 
available Zinc Finger Database (ZiFDB) 
that provides information on functional 
zinc fingers [72]. Fusing ZF DNA-binding 
arrays to the LBD of a nuclear hormone re- 
ceptor produced ZF-TFs that regulate their 
targets in response to hormone inputs [73, 
74). 

Although they represented a great tech- 
nical advance, ZF-TFs are not trivial to de- 
sign, and in some cases can lack specificity. 
For example, adjacent zinc fingers can in- 
fluence each other’s binding specificity, 
thus complicating the design of tandem ar- 
rays of these domains [75]. An alternative 
approach utilizes the TALE proteins from 
Xanthomonas spp. bacteria, which are plant 
pathogens that use TALEs to modulate 
their host cell’s gene expression [29, 76]. 
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TALEs feature arrays of short amino acid 
repeats, with each repeat specifically bind- 
ing a single DNA base pair. Hence, similar 
to ZF-TFs, synthetic TALEs can be pro- 
grammed to bind a specific DNA sequence 
by combining specific base-pair-binding 
amino acid sequences in the correct or- 
der [29, 77]. Furthermore, the individual 
TALE domains are relatively independent 
of one another, enabling modular as- 
sembly into higher-order arrays to target 
longer DNA sequences. Custom-designed 
TALE-TFs have been used to activate re- 
porter genes in mammalian cells [34, 76]. 
Although TALEs naturally act as transcrip- 
tional activators, targeting TALE-TFs to the 
core promoter region of target genes in or- 
der to block RNAP recruitment can turn 
the TALE-TFs into transcriptional repres- 
sors, as shown in S. cerevisiae [31]. Like 
ZF-TFs, TALE-TFs can also respond to 
hormone-mediated induction when fused 
to LBDs of nuclear hormone receptors [78]. 
Another recent study presented TALE-TFs 
that regulate genes of interest in response 
to blue light (Fig. 1a) [39]. The system con- 
sisted of two parts: a TALE DNA-binding 
domain fused to the Arabidopsis thaliana 
light-sensitive cryptochrome 2 (CRY2) pro- 
tein, and the CRY2 binding partner CIB1 
fused to a transcriptional effector domain, 
such as VP64. When illuminated by blue 
light, CRY2 recruits CIB1, leading to the 
formation of a functional TF that can reg- 
ulate the gene of interest. The system dis- 
plays high specificity and signal-to-noise 
ratio, as well as faster response time 
compared with small-molecule-inducible 
TALE-TFs [39]. 

In addition to using TALE-TFs to reg- 
ulate target genes via regulatory do- 
mains such as VP16, recent studies have 
harnessed TALEs for targeting chromatin 
modifications to endogenous loci of inter- 
est. For example, TALEs fused to the TET1 


DNA demethylase or the LSD1 histone 
demethylase allow site-specific DNA or hi- 
stone demethylation in mammalian cell 
culture [79, 80]. Zhang and colleagues used 
the light-sensitive TALE system described 
above [39] to target chromatin modifica- 
tions to specific genes in response to light 
by fusing the CIB1 protein to various 
histone-modifying enzymes. This ability 
to modify the host cell’s chromatin at se- 
lected loci has exciting implications for 
future research in areas such as cell fate 
specification and maintenance, as well as 
cancer. 

A recent breakthrough in TF engineer- 
ing came from the bacterial CRISPR 
(Clustered Regularly Interspaced Short 
Palindromic Repeats) system, in which 
sequence specificity can be easily deter- 
mined by the guide RNA (gRNA) se- 
quence rather than protein engineering 
[81]. The CRISPR system serves as a 
form of immune memory in many Bac- 
teria and Archaea, as it causes the cell 
to cleave the DNA of viruses it has pre- 
viously encountered and ‘“‘remembered” 
by incorporating viral sequences into its 
CRISPR locus [82]. Short RNAs guide a 
Cas (CRISPR-associated) endonuclease to 
cleave DNA that complements the RNA se- 
quence. The results of recent studies have 
shown that the Cas9 protein from Strepto- 
coccus pyogenes and synthetic gRNAs can 
cleave specific DNA sequences when ex- 
pressed in heterologous hosts such as E. 
coli [83] and mammalian cells [84, 85]. The 
CRISPR system presents an attractive, ef- 
ficient method for reprogramming cells by 
genome engineering (e.g., knocking out a 
specific gene of interest) [84, 85]. In addi- 
tion, it was shown that a deactivated Cas9 
protein that lacks endonuclease activity 
(abbreviated dCas9) can be recruited by 
its associated gRNAs to specific DNA loci, 


where it can act as a transcriptional regu- 
lator without cleaving the target DNA [43, 
81]. Once recruited to the locus of inter- 
est, dCas9 can repress gene expression by 
sterically inhibiting RNAP binding to the 
promoter; alternatively, dCas9 fused to a 
transcriptional coactivator domain such as 
VP16 can activate target gene transcrip- 
tion [43, 86]. The potential advantages 
of the CRISPR system over ZF-TFs and 
TALEs include a greater ease of design, 
as sequence-specific gRNAs are easier to 
design than are zinc fingers or TALE mo- 
tifs, as well as the possibility of building 
a circuit with many orthogonal parts. In 
the ZF-TF and TALE systems, each target 
promoter requires a separate regulatory 
protein, while in the CRISPR system, a 
single dCas9 protein may regulate mul- 
tiple genes by associating with distinct 
gRNAs, although this configuration may 
present problems concerning the alloca- 
tion of a fixed resource (the dCas9 protein) 
among multiple gRNAs targeting different 
promoters. 

In order to facilitate the design of 
sTFs based on zinc fingers, TALEs, and 
CRISPR, a recent study presented a set 
of formal rules for building functional 
sTFs based on 11 possible architectures 
[35]. Besides considering the sTF archi- 
tecture, the sTF must also be designed 
to target a DNA motif that is not ex- 
pected to cause unwanted crosstalk with 
the host cell’s native genes. To this end, 
Lu and colleagues [36] identified 9-, 12-, 
and 15-bp DNA motifs that are underrep- 
resented in or absent from the genomes 
of six model microorganisms, including 
E. coli and S. cerevisiae. This data set can 
aid in the design of orthogonal promoters 
for synthetic gene circuits. Another study 
identified over 180 DNA sequences, each 
of 20 bp, that differed by at least 3 bp from 
all possible 20-mers in annotated human 
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promoter regions [34]. Altogether, recent 
advances in orthogonal sTFs present a re- 
source of great value for building synthetic 
regulatory circuits. 


25 
Post-Transcriptional Regulation: 
RNA-Based Circuit Engineering 


While most synthetic circuits rely on tran- 
scriptional regulation, RNA-based regula- 
tory devices are also common, due to the 
many attractive features of RNA regulatory 
molecules (for reviews, see Refs [87, 88]). 
First, the rules for rational construction of 
RNA devices with desired function are rel- 
atively well understood, enabling the con- 
struction of libraries of orthogonal RNA 
components. Rules for the rational devel- 
opment of synthetic RNA devices with de- 
sired properties have been described, such 
as the theoretical framework for the de- 
velopment of aptazymes [89] (see below). 
RNA devices are also tunable, as point mu- 
tations in the RNA molecule affect its fold- 
ing and/or strength of interaction with its 
partner RNA or DNA molecule, and hence 
its activity. Second, unlike TFs, regulatory 
RNA molecules do not require translation 
in order to function. Because of these ad- 
vantages, synthetic RNA-based regulation 
is an expanding and exciting field. Some 
examples of synthetic RNA-based regula- 
tion are presented below. 

Gene expression in mammalian cells 
can be knocked down by RNA interference 
(RNAi), complementary RNA sequences 
that target mRNA of interest for degrada- 
tion in a sequence-specific manner [90]. 
RNAi molecules, such as short interfering 
RNAs (siRNAs) or microRNAs (miRNAs), 
can be chemically transfected into the cells 
or expressed from designated vectors to 
knock down a specific gene. RNAi may be 
used to build complex regulatory devices 
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through the combinatorial regulation of 
a common output transcript by multiple 
miRNAs. Benenson and colleagues used 
this property of miRNAs to build complex 
logic circuits in mammalian cells [23, 91, 
92] (see Sect. 5 for details). In addition, in 
mammalian systems, specific 5’ intronic 
sequences and 3’ polyadenylation sites can 
be added to increase RNA stability and 
transgene translation [93]. Conversely, an 
mRNA can be destabilized with 3’ degra- 
dation tags, such as AU-rich elements 
(AREs) [94] to reduce transgene levels and 
decrease promoter leakiness. 

Synthetic RNA-based devices also in- 
clude small, trans-acting RNA molecules 
that base-pair with their target transcripts 
to either allow or inhibit transcriptional 
elongation or translation. Examples of 
such systems in E. coli include the riboreg- 
ulator [95], the pT181 attenuator [96], and 
the RNA-IN-RNA-OUT system [97]. 

The riboregulator consists of two com- 
ponents: a cis-repressed RNA (crRNA) 
encoding the protein of interest, and a cog- 
nate small trans-activating RNA (taRNA) 
[95]. In absence of the taRNA, the 5’ 
untranslated region (UTR) of the crRNA 
forms a stem loop that blocks the RBS 
and prevents translation. When the taRNA 
base-pairs with its partner crRNA, the 
stem loop unfolds, allowing protein trans- 
lation [95]. Conversely, the pT181 atten- 
uator allows target gene expression only 
in the absence of the trans-acting RNA: 
the 5’ UTR of the target gene is engi- 
neered with an attenuator loop. In the 
presence of its partner antisense RNA, 
the attenuator loop folds into a hairpin 
that exposes a terminator site, prevent- 
ing transcription of the downstream target 
gene [96]. In the absence of the antisense 
RNA, the target gene is transcribed [96]. 


While the riboregulator controls transla- 
tion, and the attenuator regulates tran- 
scription, the RNA-IN-RNA-OUT system 
couples translational control to transcrip- 
tional elongation of a target gene [97]. 
The system consists of pairs of RNA 
molecules: an RNA-IN molecule and its 
cognate RNA-OUT molecule. The RNA-IN 
element is placed upstream of a sequence 
encoding a small regulatory peptide, tnaC, 
followed by the coding sequence of the 
target gene. RNA-OUT forms a complex 
with RNA-IN that blocks the translation 
of tnaC. In this system, the translation 
of tnaC is necessary to enable transcrip- 
tional elongation of the downstream target 
gene; hence, the presence of RNA-OUT 
blocks transcription of the gene coupled to 
RNA-IN [97]. 

Another widely used class of regu- 
latory RNA molecules is the aptamer, 
an RNA molecule that specifically binds 
a ligand such as a small molecule or 
protein [40]. Aptamers may be used to 
build riboswitches, which regulate target 
gene expression in response to ligand 
binding. Naturally occurring riboswitches 
are found in bacterial RNAs encoding 
metabolic enzymes, where they regulate 
the enzyme’s expression in response to 
the levels of the corresponding metabo- 
lite [98]. In synthetic circuits, riboswitches 
inserted into the 5’ UTR of a target 
gene modulate target gene expression by 
either permitting or blocking transcrip- 
tional elongation or translation initiation 
of the riboswitch-coupled transcript in 
response to a specific ligand [99, 100] 
(Fig. 1b; for a review, see Ref. [40]). Be- 
cause of their usefulness, many studies 
have focused on screening for aptamers 
and riboswitches with desired activity. Ap- 
tamers that bind a ligand of interest may 
be selected in vitro from large pools of 
random nucleotides [101, 102]. Gallivan 


and colleagues screened large libraries of 
riboswitches by coupling riboswitch activ- 
ity to an easily observable readout, such 
as bacterial cell motility or fluorescence 
[103, 104]. Examples of ligands recog- 
nized by synthetic riboswitches include the 
small molecule theophylline [105], antibi- 
otics [106], and dyes [107]. As an example 
of how aptamers can be used to con- 
trol cell behavior, one study coupled the 
theophylline-responsive aptamer to the ex- 
pression of CheZ, a protein necessary for 
chemotaxis in E. coli, resulting in E. coli 
cells that migrated toward a source of theo- 
phylline [108]. 

Aptamers may also be linked with cat- 
alytically active RNA molecules called ri- 
bozymes that can cleave or splice RNA. 
Such molecules, known as aptazymes, 
carry out RNA-modifying reactions in a 
ligand-responsive manner [109, 110]. In 
one example, the small molecule theo- 
phylline binds to its target aptamer and 
activates a ribozyme that cleaves the 
target mRNA, resulting in a downreg- 
ulation of target gene expression in re- 
sponse to theophylline in S. cerevisiae cells 
[110]. A similar system was later imple- 
mented in mammalian cells to achieve 
theophylline-mediated mRNA degrada- 
tion [111, 112]. Together, riboswitches, 
aptazymes, and trans-regulatory noncod- 
ing RNAs present an expansive toolbox for 
engineering fine-tuned gene expression in 
synthetic circuits. 


2.6 
Insulator Elements 


In order for a synthetic circuit to function 
correctly, its component elements must 
be protected from unwanted interactions 
with other circuit components and with 
the host cell. Therefore, synthetic circuits 
may include insulators—DNA or RNA 
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elements that insulate one part of a 
circuit from interference by an adjacent 
synthetic part or by the host’s genome. For 
example, insulator sequences flanking a 
genetic circuit inserted into a mammalian 
chromosome help to prevent the circuit 
from becoming silenced due to local 
chromatin effects [113]. 

In another example, two RNA regu- 
latory elements placed on one mRNA 
molecule may base-pair with each other 
and disrupt each other’s function. To 
avoid this interference, the RNA regu- 
latory elements may be separated physi- 
cally through ribozyme-mediated cleavage 
[87]. A recently described alternative to ri- 
bozymes is the Csy4 endonuclease from 
the CRISPR system [114] (see Sect. 2.4), 
which cleaves RNAs at a 28-nucleotide 
recognition sequence that does not occur 
naturally in E. coli. Csy4 can efficiently 
cleave mRNAs engineered with the recog- 
nition sequence in E. coli and S. cerevisiae, 
and Csy4-mediated separation of tran- 
script components such as the 5’ UTR 
and the coding region allowed expression 
levels to be protected from context effects 
in E. coli [114]. 


27 
Post-Translational Regulation and 
Protein-Based Circuits 


The behavior of transcriptional regulatory 
circuits depends in part on the stabil- 
ity of the proteins encoded by the cir- 
cuit genes. Some applications, such as 
the oscillator (Sect. 3.1), require rapid 
degradation of the component TFs. In eu- 
karyotes, protein lifetime can be tuned 
by fusing the protein to a tag that tar- 
gets it for ubiquitin-dependent degrada- 
tion [115]. In E. coli, the 11-amino acid ssrA 
tag targets proteins for rapid degradation 
by the ClpXP protease [116]. To achieve 
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inducible, fast degradation of synthetic 
proteins, Hasty and colleagues engineered 
an S. cerevisiae strain for expression of 
ClpXP from an IPTG-inducible promoter 
[41] (Fig. 1c). The addition of IPTG stimu- 
lates the production of ClpXP, which leads 
to a specific degradation of ssrA-tagged tar- 
get proteins. The amount of ClpXP, and 
hence the degradation rate, can be tuned 
by the amount of IPTG added [41]. More- 
over, specific tags have been developed 
that enable inducible protein degradation, 
in which protein degradation is inhibited 
by small molecules [117]. 

In addition, protein-based regulatory 
cascades may be designed and mod- 
eled after intracellular signal transduc- 
tion pathways found in eukaryotic cells 
[118]. Protein-based signaling pathways 
have several advantages over transcription- 
based circuits. Protein-based signaling 
occurs at much faster timescales than 
transcription; moreover, protein-based sig- 
nal transduction relies on processes (such 
as protein phosphorylation or conforma- 
tional change) that require little of the cell’s 
energy and resources. However, to design 
protein-based synthetic devices presents 
unique challenges, mainly because it is 
currently extremely difficult to predict a 
protein’s function and activity from its 
amino acid sequence. Instead of ratio- 
nally designing proteins de novo, the con- 
struction of synthetic post-translational 
regulation relies on repurposing cat- 
alytic and regulatory domains from nat- 
urally occurring signaling proteins, and 
wiring them together in novel ways. Key 
examples include rewiring the pheromone 
signaling response in S. cerevisiae [42] 
(Fig. 1d), as well as using bacterial pro- 
teins to attenuate signaling pathways in 
yeast and in human T cells, with possi- 
ble applications for T-cell-based therapy 
[119]. 


3 
Dynamical Circuits 


The building blocks described above have 
been used to construct synthetic gene cir- 
cuits of varying complexities and purposes. 
The first synthetic gene circuits that were 
constructed were not intended for practi- 
cal applications, but were simple dynam- 
ical networks intended to demonstrate 
the proof-of-principle ability to engineer 
gene circuits with desired behaviors. Early 
examples of dynamical circuits were de- 
scribed in 2000 [8, 9], and have formed the 
basis (either conceptually or practically) 
for much work on synthetic circuits since 
then, whether applied or not. These two 
circuits were the oscillator and the switch. 


3.1 
Oscillators 


Oscillators are common components in 
electronic devices [120]. An oscillator sta- 
bly interchanges between two states and 
can be characterized by its amplitude and 
period. The first synthetic gene oscillator 
circuit was the E. coli “repressilator’’ [8], 
a ring of three transcriptional regulators, 
LacI, TetR, and CI, each repressing the 
next one in the ring (Fig. 2a). As Lacl re- 
presses the transcription of tetR, it relieves 
the repression by TetR on the cI gene, al- 
lowing CI levels to rise and repress Jacl; 
the now de-repressed tetR gene product 
begins to accumulate until it can repress 
cl, and the cycle continues. Studies on the 
repressilator sparked almost a decade of fo- 
cus on synthetic gene oscillators (see Ref. 
[121] for an extensive analysis). Studies on 
oscillators illustrate the importance of cir- 
cuit topology and parameters on overall 
circuit function, and provide an example 
of the ‘“‘design—build-test” cycle of syn- 
thetic circuit engineering. Following the 
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Fig. 2. Synthetic gene oscillators. (a) 
The repressilator consists of three 
genes in a ring, each repressing 
transcription from the next one in 
the chain [8]. Gene and promoter 
names are given; (b) The robust os- 
cillator constructed by Stricker et al. 
uses a combination of positive and 
negative feedback [122]. In both cir- 
cuits the arrows represent transcrip- 
tional activation, while flat-ended 
arrows represent transcriptional re- 
pression. Modified from Ref. [122]. 


initial repressilator, a number of improved 
versions were designed and implemented. 

Despite being a significant break- 
through, the repressilator was not a 
robust circuit as it only functioned in 
about 40% of cells. It was not until 2008 
that a highly robust oscillator circuit was 
constructed [122]; this oscillator had a 
different topology from the repressilator, 
comprising one gene that activates itself 
and another gene, this second gene 
feeding back to and repressing the first 
gene (Fig. 2b). This design is known as an 
amplified negative feedback oscillator, where 
“amplified” refers to the first gene’s 
positive feedback on itself. A number of 
other oscillators have been constructed, 
the majority within the bacterium E. 
coli [123, 124], and one in Salmonella 
typhimurium [125]. For example, one study 
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implemented a metabolic oscillator using 
fluctuations in metabolite pools [123]. 

Oscillators have also been successfully 
introduced into mammalian cells. In a 
negative feedback loop-based oscillator, an 
intron placed upstream of a repressor pro- 
tein coding region increases oscillatory 
period length with the size of the in- 
tron, due to increased transcript length 
[126]. In another example, a tunable os- 
cillator was constructed by combining 
a feedforward loop and a time-delayed 
negative feedback loop based on an 
autoregulated sense—antisense transcrip- 
tion control [127]. A subsequent study 
produced a low-frequency oscillator, in 
which a time-delayed negative feedback 
loop was based on a short hairpin RNA 
(shRNA) encoded in the intron of a 
self-regulated transactivator [128]. Finally, 
synthetic—natural hybrid oscillators can be 
constructed in human cells based on the 
structure of natural networks, such as the 
p53 pathway [129]. 

Now that oscillators can be constructed 
routinely, the challenge is to integrate 
them into larger synthetic circuits. A po- 
tential application is to use oscillators 
as timer circuits, keeping time for the 
rest of a synthetic circuit. Oscillators may 
also be used in frequency multiplier cir- 
cuits, allowing different parts of a larger 
circuit to be kept at different timings 
[130]. Attempts to connect oscillators to 
downstream circuits have highlighted a 
common issue that is known to elec- 
trical engineering when connecting up 
different circuits: retroactivity, which oc- 
curs when a downstream component of 
a circuit affects its upstream components 
[131, 132]. In genetic oscillators, retroac- 
tivity occurs because some of the protein 
used in the oscillator itself has to be used 
to drive the downstream circuit by binding 
to the downstream gene promoters. This 
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binding sequesters the oscillator protein, 
affecting the oscillator dynamics. One ap- 
proach to minimizing retroactivity is to 
use an amplifier circuit as an intermediate 
between the oscillator and the downstream 
circuit. The amplifier requires only a small 
amount of protein from the oscillator, min- 
imizing the effect on the oscillator, but is 
strongly expressed, giving a large output 
sufficient to drive the downstream cir- 
cuit. This approach has been successfully 
demonstrated in an in vitro synthetic oscil- 
lator based on hybridization of synthetic 
DNA and RNA oligonucleotides (see Ref. 
[133] for details). 


3.2 
Toggle Switches 


Along with the oscillator, another example 
of a simple functioning device is a toggle 
switch: a circuit that can switch between 
two stable states, ON and OFF, similar to 
the light switch of a lamp. As a result, 
a switch possesses a form of memory, 
which is vital for cellular computation (see 
Sect. 4). The genetic switch, commonly 
called a toggle switch, consists of two tran- 
scriptional repressors, each repressing the 
other (Fig. 3). Collins and colleagues de- 
scribed a functional toggle switch in a 
living E. coli cell [9]; the switch uses two 
small molecule-responsive transcriptional 
repressors — TetR, which binds aTc, and 
Lacl, which binds IPTG (see Sect. 2.2). The 
addition of aT inactivates TetR, switching 
the system into a state with high expres- 
sion of LacI, which then maintains a low 
concentration of TetR. Conversely, addi- 
tion of IPTG inactivates Lacl, switching 
the system into a high-TetR state, which 
then maintains a low concentration of 
LacI. Another version of the toggle switch 
uses a heat pulse as a means of switching 
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Fig. 3 The toggle switch. The origi- 
nal synthetic toggle switch in E. coli 
consists of two repressors, lac! and 
tetR, which repress each other’s tran- 
scription [9]. The small molecules 
aTc and IPTG bind to TetR and Lacl 
proteins, respectively, preventing 
them from repressing transcrip- 
tion and therefore allowing expres- 
sion of lacl and tetR, respectively. 
Flat-headed arrows depict transcrip- 
tional repression. Adapted from 

Ref. [9]. 


off a temperature-sensitive transcription 
repressor [9]. 

A mammalian toggle switch was also 
developed based on two transcriptional re- 
pressors, enabling it to switch between 
states according to two antibiotic inputs 
[134]. Subsequently, an RNAi-based toggle 
switch was designed for the tight regu- 
lation of gene expression in mammalian 
cells that switch between states accord- 
ing to small-molecule inputs [135]. The 
switch could be used to tightly control the 
expression of potentially useful proteins, 
such as the toxic alpha chain of diphtheria 
toxin and the pro-apoptotic protein BAX 
(BCL-2-associated X protein) that can kill 
cancer cells. Recently, a light-responsive 
toggle switch was developed, in which 
gene expression is induced by red light 
and silenced by far-red light [136]. 

As is the case with the oscillator, the 
function of the toggle switch depends on 
the relative expression levels of its compo- 
nent transcription repressors — the mutual 
repression between the two genes must be 


approximately balanced. By unbalancing 
the relative levels of the two repressors, 
Ellis et al. were able to convert the E. 
coli toggle switch into a genetic timer 
[137]. In this way, the authors changed 
the dynamics of the switch from bistable 
to monostable, so that the circuit would re- 
turn to its initial state after perturbation. In 
this implementation, the toggle switch was 
set up to return to a stable state with low 
Lacl expression. Induction with ac moved 
the system into a high-Lacl, low-TetR state. 
Upon the removal of aTc, the system even- 
tually returned to low LacI and high TetR 
conditions. The time needed to reset the 
system confers the circuit’s timer action, 
which can be controlled by changing the 
strengths of the various promoters [137]. 

A further advance in switches was made 
with the construction of a push-on-off 
switch [138]. Whereas, the toggle switch re- 
quires two inputs, the push-on-off switch 
switches repeatedly back and forth be- 
tween two states using just one input. 
This circuit was larger and more complex 
than the original toggle switch (see Ref. 
[138] for details), and was not particularly 
robust; the fraction of cells switching de- 
creased quickly as the number of rounds 
of switching increased, in part because the 
input used — UV light — caused cell lethal- 
ity. The authors proposed ways to improve 
future implementations of the circuit, in- 
cluding using an input that would be less 
harmful to the host cell [138]. 

Like oscillators, toggle switches can also 
be coupled to cell behaviors of interest. 
For example, Kobayashi et al. built a toggle 
switch circuit in E. coli to induce biofilm 
formation in response to DNA damage 
[139]. The toggle switch consisted of the 
Cl and LacI TFs mutually repressing each 
other. This circuit starts out in a low-Lacl, 
high-CI state. A pulse of DNA-damaging 
radiation then triggers expression of the 
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native RecA protease, which cleaves Cl 
and flips the toggle switch into low-Cl, 
high-Lacl state, which in turn promotes 
expression of a gene required for biofilm 
formation [139]. 

In a recent study, a synthetic toggle 
switch was used to address a fundamental 
question in cell biology: How can cells 
in an isogenic population achieve distinct 
cell fates when starting from the same 
state [140]? The authors implemented a 
toggle switch in S. cerevisiae that can 
exist in two states: the high-LacI /low-TetR 
state, which expresses green fluorescent 
protein (GFP) but not mCherry; and the 
low-LacI/high-TetR state, which expresses 
mCherry but not GFP. Addition of the 
TetR inhibitor aTc switches the cells into 
the high-LacI, GFP-positive state. The 
system was “initialized” by growing the 
cells in glucose, which blocked expression 
of all synthetic proteins. The cells were 
then transferred to galactose media to 
permit protein expression, and their fate 
(high versus low GFP) was observed in 
the presence of varying amounts of aTc. 
In the presence of an intermediate aTc 
concentration, the isogenic cell population 
eventually reached a bimodal distribution, 
with approximately equal numbers of cells 
becoming GFP-positive and GFP-negative 
[140]. In contrast, when given a high or low 
dose of aTc, all cells reached a high-GFP 
or low-GFP state, respectively. Moreover, 
the authors showed that the intermediate 
aTc concentration that leads to a bimodal 
distribution of cell fates depends on 
the expression levels of LacI and TetR: 
for example, a cell line engineered to 
express a lower amount of TetR relative 
to LacI will need less aTc to reach a 
bimodal distribution. The experimental 
evidence and accompanying mathematical 
model indicate that isogenic cells can 
stochastically reach distinct cell fates when 
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exposed to a signal whose concentration 
is in a critical intermediate range, which 
depends in turn on the expression levels 
of the cell’s regulators that respond to 
the signal [140]. In summary, the authors 
used a simple synthetic circuit in order 
to gain insight into a complex biological 
phenomenon. 


3.3 
Gene Cascades 


In addition to the repressilator and tog- 
gle switch, synthetic biologists have built 
genetic cascades: systems in which an 
upstream gene product regulates a down- 
stream gene, which in turn regulates a 
gene further downstream, and so on. 
Weiss and colleagues [141] constructed 
synthetic cascades of various lengths in E. 
coli (Fig. 4) in order to understand how the 
characteristics of the cascade relate to its 
length (i.e., the number of genes in the cas- 
cade). Longer cascades were shown to have 
sharper switching between ON and OFF 
states, and to display noisier dynamics. 
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Fig. 4 The properties of synthetic gene cas- 
cades vary with cascade length. Cascades of 
two, three, and four genes were constructed 
[141]. aTc binds to TetR protein, preventing it 
from repressing transcription from a down- 
stream promoter. The graphs on the right 
show the steady-state relationship between 
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As predicted, the response time increased 
with the length of the cascade. This fea- 
ture allowed the longer cascade to act as a 
low-pass filter, with the output less affected 
by rapid fluctuations in the input than 
fluctuations persisting for a longer time. 
However, this filtering ability came at the 
expense of synchrony: the longer the cas- 
cade, the more variability was observed in 
response time across the cell population. 
The authors pointed out that genetic cas- 
cades are common in Nature, but that if the 
cascade regulates a process that requires 
the cells to act in concert (such as devel- 
opment of a multicellular organism), then 
additional regulatory mechanisms must 
exist to ensure synchrony [141]. 


3.4 
Bandpass Filters 


In electronics, bandpass filters are made to 
block the transmission of any wavelength 
under and above a specific range, thus al- 
lowing only a specific range of wavelength 
with a defined minimum and maximum 
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the concentration of aTc (input) and yellow 
fluorescent protein (YFP) fluorescence (out- 


put). As the length of the cascade increases, 


this relationship becomes more sensitive, with 


sharper switching between ON and OFF states 
[141]. Modified from Ref. [141]. 


to pass [142]. In biology, a bandpass filter 
can be used to limit the response of a gene 
network to predefined biological input lev- 
els, ignoring inputs that are too weak or 
too strong. 

A tunable bandpass filter was imple- 
mented in E. coli using a feedforward 
loop in which a f-lactamase gene (bla) 
and a tetracycline resistance gene (tetC) 
are linked by mutually repressive inter- 
actions [32]. The Bla enzyme allows the 
bacteria to survive in ampicillin (Amp) 
by hydrolyzing Amp. A minimum thresh- 
old level of Bla is required for the cell 
to survive in the presence of Amp, but 
if Bla accumulates above the maximum 
threshold level, this leads to a repression 
of the tetC gene and hence makes the 
cell sensitive to tetracycline (Tet). As a re- 
sult, the circuit acts as a bandpass filter 
when the cells are grown in the presence 
of both Amp and Tet: only cells with a 
specific range of circuit activity, assayed 
by a GFP reporter, will survive. An inter- 
esting feature of the circuit is the control 
of the bla gene by an IPTG-inducible pro- 
moter, which makes the circuit tunable: 
varying the IPTG concentration in the me- 
dia changes the properties of the bandpass 
filter (the concentration of Amp and Tet 
that allows the cell to survive) [32]. A band- 
pass filter was also built in mammalian 
cells. Through a combination of inter- 
connected transcriptional activators and 
repressors, the cells were programmed to 
express a reporter gene only in the pres- 
ence of intermediate input concentrations 
[143]. 


4 
Memory Devices 


Many synthetic circuits respond to specific 
inputs quickly and reversibly: the cell 
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performs a desired action in the presence 
of input such as a small molecule or a 
specific wavelength of light, and then, 
once the input is withdrawn, the cell 
returns to its ‘‘ground” state. The kinetics 
of the circuit's response to the input 
are important for its function, and many 
applications require the circuit to return to 
the pre-input state promptly. 

In contrast, some applications require 
the cell to maintain memory of a transient 
stimulus, even over long time periods and 
multiple generations. One potential use of 
a memory circuit is to record the cell’s his- 
tory of intrinsic or extrinsic events, such 
as exposure to toxic chemicals or DNA 
damage, that are relevant in environmen- 
tal remediation and medical diagnostics. 
A counter enables the cell to remember 
not only prior exposure to a stimulus, but 
also the number of times it was exposed. 
Memory circuits may enable engineered 
microbes to respond to complex and un- 
certain environments, based not only on 
current input but also on memory of past 
events; this ability would allow more com- 
plex cell responses than those permitted by 
simple digital logic (see Sect. 5). Potential 
uses include bioremediation, support for 
crops growing in marginal soil, or dis- 
ease treatment [144]. Moreover, during 
the development of a complex multicel- 
lular organism, each cell must at some 
point commit to a specific cell fate, and 
its progeny must maintain memory of 
the cell fate choice. Synthetic memory 
devices may mimic the complex cellular 
logic implemented during differentiation; 
for example, a memory device may pro- 
duce distinct outputs depending on the 
sequence of inputs (A then B versus B 
then A). These memory circuits may be 
useful in studying or programming cellu- 
lar differentiation for applications such as 
tissue engineering. 
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Synthetic memory can be divided into 
volatile and nonvolatile [145]. Volatile 
memory requires the cell to maintain 
the memory of the past event actively, 
whereas nonvolatile memory is passively 
maintained by the cell following the initial 
stimulus. These memory systems can be 
analogized to dynamic and static memory 
in electronic systems, respectively. Many 
of the synthetic memory devices, such 
as the autoregulatory feedback loop de- 
scribed below, resemble those that occur 
naturally in cells and function extensively 
in processes such as cell development 
[146, 147]. In mammalian cells, naturally 
occurring long-term memory also relies 
heavily on the chromatin state, which may 
present another opportunity for imple- 
menting synthetic memory devices [148]. 
Currently, an incomplete knowledge of 
the mechanisms and outcomes of spe- 
cific chromatin modifications has limited 
the ability to design chromatin-based syn- 
thetic memory devices. However, synthetic 
chromatin-modifying factors have been 
described [149], and synthetic zinc finger- 
and TALE-based DNA-binding domains 
have allowed chromatin-modifying pro- 
teins to be targeted to specific loci in living 
cells [39, 80] (Sect. 2.4). A greater under- 
standing of chromatin modifications may 
enable a more sophisticated and reliable 
use of chromatin-based synthetic circuits 
in the future. 


4.1 
Volatile Memory 


A volatile memory device requires active 
maintenance of the memory state by the 
cell (e.g., through transcriptional regula- 
tion). The toggle switch (see Sect. 3.2), 
in which two genes, each responsive to a 
distinct input, repress each other’s expres- 
sion, is an example of volatile memory: 


the cell remembers the last input it re- 
ceived, even after the input is withdrawn, 
but maintenance of this memory requires 
active gene expression [9, 134]. 

Volatile memory has also been imple- 
mented through autoregulatory feedback 
loops, which consist of two sTFs: a sen- 
sor TF and a self-activating “loop” TF. A 
transient stimulus activates the sensor TF, 
which in turn initiates expression of the 
loop TF. Once the loop TF accumulates 
above a threshold level, it maintains its 
own expression in a positive autoregula- 
tory feedback loop even after the sensor 
TF becomes inactivated [18, 150]. In an 
early example, S. cerevisiae cells were pro- 
grammed to turn on long-term yellow 
fluorescent protein (YFP) expression in re- 
sponse to a transient galactose input [150]. 
In a subsequent study, mammalian cells 
were programmed to remember past ex- 
posure to hypoxia and DNA damage, both 
of which are linked with cancer [18]. 

Another circuit design exhibits a differ- 
ent form of memory: the ability to count 
three pulses of a defined small-molecule 
input [151]. In this circuit, expression of 
the final output (GFP) depends on a se- 
ries of orthogonal phage-derived RNAPs 
that are under the control of a riboregu- 
lator (see Sect. 2.5): the RNAP-encoding 
transcripts cannot be translated, due to 
cis-repressive RNA structures present in 
the transcripts. Each pulse of the in- 
ducer triggers the expression of a small 
trans-acting RNA, which relieves the 
RNA-based repression and allows each 
RNAP transcript to be translated. After 
three pulses of the inducer, the cascade is 
complete and the GFP reporter is highly 
expressed [151]. The device requires the 
pulses to be spaced closely together, to 
prevent each RNAP from being degraded 
or diluted by cell division before the next 
RNAP in the cascade can be expressed. 


4.2 
Nonvolatile Memory 


In contrast to volatile memory, nonvolatile 
memory does not require continuous ac- 
tion by the host cell; rather, the memory 
device retains its state passively after the 
initial switching event. Nonvolatile mem- 
ory devices have been implemented using 
site-specific DNA recombinases that can 
insert, excise, and invert DNA fragments at 
specific DNA sequence motifs. Advantages 
of nonvolatile memory devices include a 
lower metabolic burden on the cell, be- 
cause the cell does not expend energy to 
maintain the memory state, as well as 
the ability to construct a multistate circuit 
from relatively few parts by arranging the 
recognition sites for two or more orthog- 
onal DNA recombinases. The number of 
possible states of the system grows expo- 
nentially with the number of orthogonal 
recombinases available [144]. 

Notably, unlike volatile memory, non- 
volatile memory based on DNA recom- 
binases persists after the removal of the 
memory device (the rearranged DNA frag- 
ment) from the host cell. This has led to the 
intriguing idea that such memory devices 
could be shared among cells, via the trans- 
formation of a ‘“‘naive”’ cell with DNA from 
a lysed memory cell [144], leading in turn 
to the possibility of complex multicellular 
computation. 

Examples of synthetic nonvolatile mem- 
ory include the use of bacterial invertases 
that catalyze inversion of a fragment of 
DNA flanked by specific inverted repeat 
sequences [144, 152]. In one example, in- 
duction of the E. coli FimE invertase leads 
to an irreversible flipping of the promoter 
of the gfp reporter gene from the OFF state 
to the ON state [152]. In a more complex 
follow-up study, the use of two orthogonal 
invertases (Fim from E. coli and Hin from 
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Salmonella) allows for the programming 
of devices with many possible states, in- 
cluding devices that can produce different 
output depending on the history of the sys- 
tem: Fim followed by Hin, versus Hin then 
Fim (Fig. 5a) [144]. While these studies pre- 
sented an exciting proof-of-principle, the 
resulting circuits did not behave entirely 
as expected, due partly to the very low rate 
of inversion by Fim [144]. 

Another early example of a 
recombinase-based synthetic mem- 
ory device is the DNA invertase cascade 
in E. coli that can count to three: reporter 
GFP expression becomes active only after 
three pulses of an inducer molecule [151]. 
In this system, each pulse of the small 
molecule input drives expression of a 
recombinase, which in turn catalyzes a 
DNA inversion that prepares the circuit to 
respond to the next inducer pulse. One 
version of the circuit counts three pulses 
of the same small molecule, arabinose. 
Another version respond to pulses of three 
different inducers given in a specified 
order: for example, the circuit expresses 
GFP if aTc, arabinose, and IPTG are 
supplied in that order, but not if the order 
of the stimuli is rearranged [151]. This 
system is useful for modeling processes 
such as development, where the timing 
of the stimuli matters in addition to the 
identities of the stimuli. 

Other studies in this area have utilized 
phage-derived serine recombinases. Orig- 
inally used in bacteria, codon-optimized 
recombinases also function in eukaryotic 
cells. These enzymes catalyze recombina- 
tion between two recognition sites, termed 
attP and attB (attachment phage and at- 
tachment bacterium, respectively), result- 
ing in the formation of attL and attR sites 
(Fig. 5b). Because the recombinase does 
not recognize aitL and attR sites, the reac- 
tion is irreversible unless an excisionase is 


271 


272 | Synthetic Gene Circuits 


(a) rn" GOW =; GOW 
Hin 


BOWS ROW 


“~ GDUCBan 


(b) 


attB attP 


oe 


Recombinase 


Fig. 5 Synthetic nonvolatile memory devices. 
(a) The bacterial invertase-based memory 
device uses two orthogonal enzymes, Fim 
and Hin, which mediate DNA inversion be- 
tween sites marked ‘‘F” and ‘‘H,” respectively. 
Note that the final state of the system de- 
pends on the order of inputs: Fim followed 
by Hin versus Hin followed by Fim [144]; (b) 
Phage-derived serine recombinases catalyze 
DNA inversion (left) or integration/excision 
(right) between specific recognition sites, 
termed attB and attP. The reaction produces 
attL and attR sites, which are not recognized 


added (see below). Depending on the lo- 
cation and orientation of the starting attB 
and attP sites with respect to each other, 
the reaction results in excision, integra- 
tion, or inversion (‘flipping’) of a DNA 
fragment (Fig. 5b). Site-specific recombi- 
nases from multiple phages have been 
described, including TP901, Bxb1, and 
~C31 [154]. Notably, these recombinases 
are orthogonal, with each recognizing a 
specific and unique pair of attP and attB 
sites, a property that allows the use of 
multiple recombinases in a single cell. 

A recent study demonstrated the con- 
struction of all 16 two-input Boolean 
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by the recombinase, making the reaction irre- 
versible unless a cognate excisionase is added; 
(c) An example of a two-input digital logic 
gate implemented with two orthogonal serine 
recombinases, Bxb1 and $C31 [153]. In this 
AND gate, two transcriptional terminators pre- 
vent expression of the GFP reporter from the 
upstream promoter. When both recombinases 
are present, the two terminators are switched 
to the “‘off’ position, and GFP expression can 
proceed. Panel (a) modified from Ref. [144]; 
panel (c) modified from Ref. [153]. 


logic gates (see Sect. 5) in E. coli us- 
ing two orthogonal site-specific recom- 
binases, Bxbl and $C31 [153]. In this 
study various combinations of promoters, 
transcriptional terminators and reporter 
coding regions flanked by recombinase 
target sites were used to achieve the de- 
sired logic function. For example, AND 
logic was implemented by inserting two 
transcriptional terminators, one flanked 
by Bxb1 sites and the other by C31 sites, 
between a promoter and a GFP-coding 
region, such that the two terminators 
were flipped to the OFF orientation and 
GFP was expressed only in the presence 


of both recombinases (Fig. 5c). The ex- 
pression of the two recombinases was 
coupled to two different small-molecule 
inputs, so that Boolean logic computa- 
tions were performed upon addition of 
the specified input molecules. As with the 
previous examples of recombinase-based 
circuits, the circuits passively maintained 
their state upon removal of the stimulus. 

Although a serine integrase alone cat- 
alyzes an irreversible reaction (attB+ 
attP — attL + attR), the combination of the 
integrase with a cognate excisionase can 
catalyze the reverse reaction and restore 
the attB and attP sites. A recent study 
harnessed this phenomenon to build a re- 
versible memory module in E. coli [155]. In 
this system, expression of the Bxb1 inte- 
grase alone leads to red fluorescent protein 
(RFP) reporter expression; coexpressing 
the Bxb1 integrase with its partner ex- 
cisionase inverts the orientation of the 
reporter gene promoter, resulting in GFP 
expression [155]. In theory, the system can 
“flip” repeatedly between the GFP-on state 
and the RFP-on state through repeated in- 
duction of integrase alone or integrase 
with excisionase. However, functioning of 
the system requires careful adjustment of 
the integrase:excisionase ratio [155]. 


5 
Boolean Logic and Digital Circuits 


Most of the synthetic gene circuits that 
have been implemented so far have ben- 
efited from digital logic design, in which 
cellular networks are thought of as assem- 
blies of digital logic gates. A digital logic 
gate is a device that carries out a Boolean 
operation and produces a Boolean output 
based on the different combinations of in- 
puts it receives. Both the inputs and the 
output of a logic gate are in the form of 
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TRUE and FALSE, often represented as 
1 and 0, respectively. Examples of basic 
Boolean operations include AND (output 
is TRUE only if all inputs are TRUE), 
NAND (output is TRUE unless all inputs 
are TRUE-the negation of AND), OR 
(output is TRUE if at least one input is 
TRUE), and NOR (output is true if none 
of the inputs is TRUE-the negation of 
OR) (Fig. 6). Most digital logic gates used 
in synthetic biology have been two-input 
gates, of which there are 16 types. Inter- 
connecting such two-input logic gates so 
that the output of some gates acts as in- 
put for others can be used to design more 
complex logic circuits. 

The domain of digital logic computation 
in living cells is modeled after engineering 
electronic circuits, with the goal of pro- 
gramming cells similarly to programming 
a digital computer, only with biological 
inputs and outputs rather than electronic 
ones. Instead of physical wires, molecules 
such as RNA and protein connect logic 
gates inside the cells. Such circuits could 
be implemented for many purposes, in- 
cluding diagnostics and therapeutics, in 
which the cell is designed to perform a 
specific behavior in response to specific 
stimuli, such as “if the inputs match the 
profile of a cancer cell, kill the cell” [22, 
23, 156-158]. Moreover, since biological 
inputs can be converted to electric ones 
and vice versa [136, 159-161], cells could 
potentially be integrated as biological com- 
putational units for electronic computers. 

Compared to the assembly of their 
silicon-based counterparts, the construc- 
tion of complex biological logic circuits by 
layering elementary logic gates has proven 
to be extremely challenging. Over the past 
decade, many different designs with vari- 
ous functions have been implemented in 
both prokaryotic and eukaryotic cells, but 
the level of complexity of these designs 
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Fig. 6 Digital logic gates. A few examples of 
two-input logic gates as well as the truth table 
that each one encodes are shown. (a) OR 
gate; (b) NOR gate; (c) AND gate; (d) NAND 
gate. A truth table summarizes the outputs to 
each possible combination of inputs of a logic 
gate. The logic gates are drawn according to 
the conventions of electrical engineering. The 
bubble in (b) and (d) symbolizes negation; 


has not improved significantly [162]. This 
is in part due to unwanted crosstalk be- 
tween components of synthetic circuits 
(Sect. 2.3), the propagation and amplifi- 
cation of noise through the circuit (Sect. 
8.3), and the metabolic burden that accom- 
panies the expression of exogenous genes 
in host cells (Sect. 8.2). In this section, a 
few hallmark examples of synthetic digital 
logic circuits that have been implemented 
are briefly provided. 

An early example of a digital logic cir- 
cuit used chimeric promoters that bind 
small-molecule-responsive transcriptional 
activators and repressors. Depending on 
the combination of activators and repres- 
sors used, the system gave rise to different 
logic gates, including AND, NAND, and 
NIMPLY (output is TRUE only when one 
specific input is TRUE) gates [163]. Sub- 
sequent studies employed split inteins 
which, when fused to separate proteins, 
would mediate fusion of the two proteins 
into one. Hence, split inteins would al- 
low the reconstitution of a functional TF 


(e) Schematic representation of a genetic NOR 
gate: the presence of either or both of the two 
inducers activates the expression of the repres- 
sor protein, which in turn represses the re- 
porter output. Hence, the output is expressed 
only when neither of the inducers is present, 
as per the truth table shown in panel (b). In1, 
input 1; In2, input 2; Out, output. 


that can activate or repress an output pro- 
moter, forming an AND or NAND logic 
gate, respectively. Split inteins fused to 
sequence-specific ZF-TFs (see Sect. 2.4) 
were used to build a variety of logic gates in 
mammalian cells [70]. Recently, the split 
intein strategy was used to build AND 
gates in mouse stem cells [164], suggest- 
ing the possibility of future applications in 
programming stem cell fate for research 
and therapy. 

Other digital logic circuits combined 
transcriptional and _post-transcriptional 
regulation. For example, Arkin and col- 
leagues [165] constructed a two-input AND 
gate in E. coli. The first input activates tran- 
scription of a gene encoding T7 RNAP, 
but the protein cannot be translated due 
to the presence of two premature amber 
stop codons. The second input turns on 
expression of an amber suppressor tRNA, 
so that the presence of both inputs allows 
expression of T7 RNAP protein and sub- 
sequent transcription of the reporter gene 
from a T7 promoter [165]. The authors 


linked the AND gate to the expression of 
Invasin in bacteria, such that the bacte- 
ria invaded human cancer cells in vitro 
only in the presence of two user-defined 
inputs [165, 166]. This circuit represents 
a proof-of-principle strategy that may be 
used for cancer therapy in the future. 
Digital logic circuits may also be im- 
plemented in mammalian cells using 
miRNAs, as shown by Benenson and col- 
leagues, who built computational devices 
in mammalian cells by targeting multi- 
ple miRNAs to the same output transcript 
[91, 92]. Expression of the miRNAs may 
be coupled to the presence or absence 
of user-defined transcriptional activators 
or repressors, which act as inputs to the 
circuit [92]. For example, if a transcript 


miR-21 


Fig. 7 Logic computation in mammalian 
cells: the miRNA classifier circuit. The HeLa 
cancer cell classifier causes production of 
the pro-apoptotic protein hBax and hence 
the death of the host cell if, and only if, all 
“HeLa-high” miRs are highly expressed and 
none of the ‘‘HeLa-low” miRs is expressed 
above a threshold level. The ‘‘HeLa-high” 
miRs regulate an inverter module: the miRs 
prevent expression of the rtTA transcription 
activator, which in turn activates expression 
of the Lacl transcription repressor (the miRs 
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that encodes a fluorescent protein out- 
put has recognition sequences for two 
miRNAs - miR-a, which is repressed by 
TF-A, and miR-b, which is activated by 
TF-B — then the resulting circuit will pro- 
duce fluorescent output only if A is present 
and B is absent, implementing the digital 
logic function A AND NOT B. More com- 
plex circuits may be built by combining 
multiple miRNA-responsive output tran- 
scripts in one cell [91, 92]. 

The potential for practical applications of 
miRNA-based logic was demonstrated by 
a circuit that induces cell death specifically 
in the HeLa cancer cell line [23] (Fig. 7). 
The circuit accepts as inputs the levels of 
two user-defined sets of miRNAs: those 
expected to be highly expressed and those 
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also target the lacl transcript to improve the 
signal-to-noise ratio). Hence, in HeLa cells, 
the “HeLa-high” miRs prevent expression of 
the Lacl protein, permitting transcription of 
hBax mRNA, which has sequences comple- 
mentary to ‘‘HeLa-low” miRs in its 3’ UTR. In 
a healthy cell, high levels of one or more of 
the ‘“HeLa-low” miRs block hBax expression, 
whereas in HeLa cells hBax translation can 
proceed. As a result, only HeLa cancer cells 
undergo apoptosis in the presence of the clas- 
sifier circuit. Adapted from Ref. [23]. 
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expected to be absent from HeLa cells 
(‘‘HeLa-high” and “HeLa-low’ miRNAs, 
respectively). The output of the circuit is 
the pro-apoptotic protein BAX, which trig- 
gers cell death. The ‘““HeLa-high” miRNAs 
prevent expression of the LacI transcrip- 
tional repressor, which in turn represses 
the BAX-encoding gene. Hence, BAX pro- 
tein expression is possible only when the 
“HeLa-high” miRNAs are present above 
a threshold level. In parallel, ‘““HeLa-low” 
miRNAs downregulate the BAX transcript 
via their recognition sites in the BAX tran- 
script 3’ UTR. Consequently, the circuit 
leads to BAX expression if, and only if, 
the host cell matches the HeLa cell pro- 
file — that is, all “HeLa-high” miRNAs are 
above a threshold level and all ‘““HeLa-low”’ 
miRNAs are below a threshold level [23]. 
The circuit triggers significantly higher 
levels of apoptosis in HeLa cells compared 
to control human embryonic kidney (HEK) 
cells, indicating its potential for future can- 
cer therapy [23]. Depending on the miRNA 
inputs used, the circuit could be designed 
to target any cell type that has a unique 
miRNA profile. 

A subsequent study in mammalian 
cells used two small-molecule inputs — 
erythromycin and phloretin — to drive the 
expression of two transgenes: one that en- 
codes the RNA-binding proteins MS2 or 
L7Ae; and one that contains the fluores- 
cent reporter coding region fused toa RNA 
box sequence that specifically binds MS2 
or L7Ae [167]. When bound to their cog- 
nate RNA boxes, MS2 and L7Ae block 
translation of the reporter transcript. This 
enables construction of NIMPLY gates, in 
which the fluorescent protein is expressed 
only if the input that activates its promoter 
is on, and the input that activates expres- 
sion of the RNA-binding protein is off. 
By combining two NIMPLY gates in one 
cell, the technically challenging task was 


accomplished of constructing a two-input 
XOR gate, which outputs TRUE only if 
exactly one input is TRUE [167]. 

While most digital logic gates involve 
transcriptional regulation, RNA-based, 
post-transcriptional mechanisms have 
also been used [87] (see Sect. 2.5). In 
a recent study, digital logic gates were 
demonstrated in mammalian cells based 
on DNAzymes, which are synthetic DNA 
molecules that bind to and cleave a 
transcript in a sequence-specific manner 
[168]. The DNAzymes include inhibitory 
stem-loop structures, which block their 
catalytic activity unless relieved by 
binding to a specific miRNA. The use 
of DNAzymes allowed the construction 
of logic gates with specific miRNAs 
as inputs, and translation of the target 
mRNA as output. This method could 
potentially be used to detect miRNA 
profiles associated with cancer [168]. 

The above-described studies demon- 
strated the potential of using digital logic 
circuits to customize cellular signaling 
and regulatory networks to achieve vari- 
ous useful applications. However, these 
logic circuits were relatively simple, and 
most of them had single-layer gates. In 
an effort to build a complex layered logic 
circuit in E. coli, Moon et al. layered indi- 
vidual AND gates to integrate signals from 
four different inputs [169]. In their design, 
AND gates accept two promoter inputs 
and control one promoter output. Each 
gate is composed of a TF that needs a sec- 
ond chaperone protein in order to activate 
the output promoter. The authors applied 
directed evolution and part mining (see 
Sect. 2.3) to minimize crosstalk between 
the gate components, allowing the gates to 
be layered in a single cell. The result was 
a four-input AND gate, the most compli- 
cated circuit implemented in single cells 
to date [169]. 


TF-based logic circuits are limited by the 
shortage of truly orthogonal parts, as well 
as the metabolic burden they impose on 
the host cell; for example, the four-input 
AND gate requires 11 regulatory proteins 
[169]. One alternative approach would be 
to use recombinase-based circuits, which 
combine digital logic and memory [153, 
170] (see Sect. 4.2). The memory mod- 
ule can be used to design robust and 
multilayered synthetic circuits, and the 
one-time inversion of a piece of DNA 
by a recombinase places less metabolic 
load on the cell than does the contin- 
ued expression of a TF. Alternatively, 
RNA-based regulatory devices (see Sect. 
2.5) can be composed into logic gates and 
cascades that place a lower metabolic bur- 
den on the cell than does the expression 
of heterologous regulatory proteins. For 
example, a four-input NOR gate based on 
the RNA-IN-RNA-OUT system in E. coli 
[97] achieved robust digital behavior while 
requiring a fraction of the resources of the 
protein-based four-input AND circuit [169] 
(see Sect. 2.5 for details). 


6 
Analog Circuits 


As described above (Sect. 5), the major- 
ity of synthetic gene circuits built thus 
far were designed to perform in the digi- 
tal domain. However, digital computation 
requires the composition of a large num- 
ber of orthogonal devices together, each 
performing a simple binary computation, 
to achieve more complex functionalities. 
Although this is possible in the world of 
semiconductors due to the ability to as- 
semble billions of transistors on a single 
substrate and to wire them together in a 
precise fashion, it is challenging to do in 
biological systems due to the paucity of 
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available parts as well as resource limita- 
tions such as energy and space. 

An alternative strategy would be to use 
analog computation, which leverages the 
inherent physics of a system to calcu- 
late mathematical functions and can thus 
achieve a greater computational complex- 
ity with fewer resources. Synthetic biolo- 
gists have designed and fabricated circuits 
that function in the analog domain, where 
a graded input is converted into graded lev- 
els of output. The digital paradigm can be 
considered as a special case of graded ana- 
log functions where values below a given 
threshold are defined as “0” and values 
above that threshold are classified as ‘‘1” 
(Fig. 8a). A key difference between digital 
and analog circuits is the way in which 
they can be composed together in order 
to construct higher-order functions. For 
example, an analog adder can be achieved 
by simply combining two parallel circuits 
that each have different input molecules 
but generate the same output molecule 
[171]. However, a digital adder cannot 
function correctly using the same prin- 
ciple; for example, adding two “1” binary 
numbers together requires another stage 
to hold the new bit ‘‘Carry out” (‘10’) 
(see Ref. [172] for an example of an adder 
implemented with digital logic in yeast). 

The main advantage of analog computa- 
tion in synthetic biology is the ability to im- 
plement complex mathematical functions 
using a small number of synthetic compo- 
nents with high efficiency. For example, 
in theory a circuit can be built that com- 
putes an integral function by measuring 
the accumulation of a protein over time, 
or a differential function by measuring 
the consumption of a protein over time. 
Moreover, analog circuits are of interest 
to synthetic biology because many cellular 
processes do not rely on the all-or-none 
responses found in digital circuits; rather, 
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Fig. 8 Schematic representation of the dif- 
ference between analog and digital circuits. 
(a) Abstraction of analog and digital circuits. 
A digital circuit recognizes only the differ- 
ence between values above (1) and below (0) 
specific threshold levels. An analog circuit rec- 
ognizes differences between graded levels of 


the cell responds in a graded fashion to 
changes in input concentration [173]. For 
example, increasing the levels of arabinose 
in the media activates progressively higher 
levels of expression of arabinose trans- 
porters and metabolic enzymes in E. coli 
[174]. Hence, analog circuits are suitable 
for programming cell behaviors for possi- 
ble practical applications. 

Living cells sense their environment us- 
ing sensory systems, many of which follow 
Weber’s law, to integrate environmental 
information such as light, sound, and 
chemotaxis [175]. Weber’s law represents 
an analog rather than digital behavior and 
defines situations where the fold-change 
between signal levels and its background 
is a constant, thus enabling fold-change 
detection as opposed to absolute level 
detection [173]. Weber’s law also applies 
to molecular signaling networks and bio- 
chemical reactions. Figure 8b shows the 
activity of forming a complex in steady 
state as a result of binding two molecules, 
such as a substrate and an enzyme or a TF 
and its target promoter. 

Weber’s law holds in the linear input 
dynamic range of a transcriptional circuit, 
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input in the linear portion of the input-output 
function; (b) Implementation of Weber’s law 
and linear range in basic biochemical reac- 
tions. Negative cooperativity expands the lin- 
ear range of the reaction. Modified from Ref. 
[171]. 


which can be either increased or decreased 
depending on the binding cooperativity of 
the reaction. Cooperativity refers to the 
effect that the first binding interaction has 
on the probability of the second binding 
event. Positive cooperativity means that 
the binding of one molecule to its target 
site increases the binding affinity for the 
second molecule. Many commonly used 
transcriptional repressors, such as Lacl, 
display a positive cooperativity in binding 
their target DNA [14]. Higher cooperativity 
leads to a steeper input-output transfer 
function [176], which in turn narrows the 
dynamic range of the system and produces 
an all-or-none response suitable to a 
digital logic circuit. In contrast, negative 
cooperativity, whereby the binding of the 
first molecule lowers the affinity for the 
second binding, is commonplace in cell 
signaling networks [177] and it allows a 
system to respond in analog fashion over 
a wide input dynamic range. 

Building an analog circuit requires a 
wide dynamic range (the range of input 
concentrations over which the system dis- 
plays a linear input-output response). One 
challenge in constructing analog synthetic 


biology circuits lies in the switch-like 
behavior of the synthetic biology parts that 
results from the narrow dynamic range 
of many biological components. In or- 
der to build an analog circuit with these 
components, the circuit topology must be 
constructed in a way that promotes a wide 
dynamic range. 

Negative feedback loops are commonly 
used to linearize the input-output transfer 
functions of electronic and biological sys- 
tems — that is, to make the input-output 
transfer function linear over a wider range 
of input concentrations. For example, one 
study implemented an autonegative feed- 
back loop to linearize the response of a 
simple circuit to the small molecule aTc 
in S. cerevisiae [176] (Fig. 9a). The au- 
thors compared two circuits. In the first 
circuit, the TetR repressor is expressed 
from a constitutive promoter, and it re- 
presses transcription of a GFP reporter 
gene. The addition of aTc relieves tran- 
scriptional repression by TetR, allowing 
GFP expression. The circuit shows a digital 
response, with a narrow transition region 
between the GFP-off state (low aTc) and 
GFP-on state (high aTc). The second cir- 
cuit is identical to the first, except that 
TetR is expressed from a TetR-repressed 
promoter — that is, autonegative feedback 
is introduced into the circuit. In contrast 
to the first circuit, the negative feedback 
loop allows linear GFP expression. A pos- 
sible explanation is that the autonegative 
feedback reduces the basal level of the 
TetR repressor, because at low aTc con- 
centrations the TetR protein represses its 
own production; this leads to higher basal 
GFP levels and, consequently, a linear 
input-output function at low levels of aTc 
[176]. The autonegative feedback circuit 
has also been constructed in mammalian 
cells and has achieved a linear response 
[11]. 
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Madar et al. [178] studied the natu- 
rally occurring autonegative feedback loop 
in the arabinose utilization system of 
E. coli (Fig. 9b). These authors showed 
that removing the autonegative feedback 
from the arabinose regulatory network 
decreased the input dynamic range by 
10-fold. The arabinose system is regulated 
by cAMP receptor protein (CRP), which is 
activated by cAMP, and AraC TF, which 
is activated by t-arabinose. AraC also re- 
presses its own promoter, creating an 
autonegative feedback loop (see [179-183] 
for details). To understand the role of the 
autonegative feedback loop, the authors 
put AraC under control of a constitutive 
promoter (Fig. 9b), and found that a loss 
of the autonegative feedback decreased the 
t-arabinose dynamic range by an order of 
magnitude [178]. 

Recently, Daniel et al. [171] implemented 
a strong negative feedback loop in a ge- 
netic circuit, yielding a power law function 
relation between the input and the out- 
put. The input to this circuit is IPTG, 
which inhibits the binding of LacI to the 
Praco promoter that drives the expression 
of AraC from a low-copy plasmid. AraC in 
turn binds to the Pgap promoter located 
on a high-copy plasmid and activates ex- 
pression of the Lacl repressor when the 
arabinose concentration is high. A strong 
and tunable negative feedback loop was 
achieved by adjusting the ratio between the 
low-copy and high-copy plasmid (Fig. 9c). 
The LacI-IPTG transfer function exhibited 
a power law function (y=x®7) over two 
orders of magnitude (see Ref. [171] for 
details), thus enabling power law compu- 
tations in living cells with few synthetic 
parts. 

A properly tuned positive-feedback loop 
can also linearize the dose response and 
extend the input dynamic range of a 
circuit. Daniel et al. [171] implemented 
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Fig. 9 Negative feedback loops in synthetic 
biology. (a) Autonegative feedback linearizes 
the dose response in a simple synthetic circuit 
[176]. The autonegative feedback loop (top), 

in which TetR represses its own expression as 
well as that of the GFP-expressing promoter, 
shows a graded, linear input-output function, 
in contrast to the steep input-output func- 
tion of the open loop (bottom); (b) Autonega- 
tive feedback loop in the arabinose utilization 


graded positive feedback loops in E. coli 
for two different types of TF, LuxR and 
AraC, and their small-molecule inputs 
(AHL and arabinose, respectively). The 
input-output transfer functions exhibited 
a wide region of linearity (over three orders 
of magnitude) when plotted on a semi-log 
plot. The circuit consists of two parts: 
the positive-feedback loop placed on a 
low-copy plasmid; and decoy binding sites 
encoded on a high-copy ‘“‘shunt’”’ plasmid, 
which reduces the positive-feedback loop 
strength by shunting away a proportion of 
the TFs that are produced by the positive 


system of E. coli creates a linear response to 
increasing concentrations of arabinose. In the 
open loop system, which lacks negative au- 
toregulation by AraC, the dynamic range is 
narrower by an order of magnitude compared 
to the natural system [178]; (c) Strong nega- 
tive feedback achieves a power law function 
[171]. HCP, high-copy plasmid; LCP, low-copy 
plasmid. Panel (a) modified from Ref. [176]; 
panel (c) modified from Ref. [171]. 


feedback circuit (Fig. 10a). The shunt 
prevents the system from saturating at 
intermediate concentrations of the input, 
thus extending the dynamic range. When 
the “shunt” circuit was removed, the 
input-output dynamic range decreased by 
two to three orders of magnitude [171]. 
Like digital circuits, analog circuits can 
also be composed into cascades and 
more complex devices. The first successful 
two-stage analog cascade in living cells was 
used to linearize the Pj,,9 promoter trans- 
fer function, which is repressed by Lacl 
[171] (Fig. 10b). The first stage consisted 


Wide dynamic range positive logarithm 
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Fig. 10 Positive-feedback loops in ana- 

log synthetic biology circuits [171]. (a) 
Wide-dynamic-range positive-logarithm cir- 
cuits for AHL and arabinose (Ara). The circuit 
achieves a wide dynamic range by using a 
positive-feedback loop and a high-copy plas- 
mid shunt. The positive-feedback loop pre- 
vents saturation of the transcription fac- 

tor at intermediate concentrations of the 


of the positive-feedback-and-shunt motif 
with AHL as the input and Lacl as the 
output, while the second stage consisted 
of a LacI-repressed Pia-o promoter driving 
the expression of an mCherry fluorescent 
output from a high-copy number plasmid. 
The input-output transfer function of the 
cascade exhibited a wide region of linear- 
ity when plotted on a semi-log plot with a 
negative-slope function that spanned over 
four orders of magnitude (Fig. 10b). 
Finally, combining analog circuits can 
enable the implementation of complex 
mathematical functions [171]. An ana- 
log adder was built by integrating, in 
parallel, two wide-dynamic-range positive- 
logarithm circuits (positive-feedback-and- 
shunt motifs) that each take in distinct 
input molecules (AHL and arabinose) and 
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log(AHL) 
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inducer, and the high-copy plasmid pre- 
vents saturation of TF-binding sites. This re- 
sults in an input-inducer-to-output-protein 
transfer function that exhibits logarithmi- 
cally linear behavior with a positive slope; (b) 
Wide-dynamic-range negative logarithm cir- 
cuit for AHL. HCP, high-copy plasmid; LCP, 
low-copy plasmid. Modified from Ref. [171]. 


generate an output signal that is com- 
mon between the two circuits, mCherry 
(Fig. 11a). An analog ratio-meter was 
also constructed using the same con- 
cept as the analog adder; however, the 
AHL-responsive positive slope circuit was 
replaced with an AHL-responsive negative 
slope (Fig. 11b). The output of the circuit 
is proportional to the ratio of the inputs 
(AHL and arabinose). The ratio-meter op- 
erates over four orders of magnitude [171]. 
Circuits capable of calculating the ratio be- 
tween two inputs are potentially useful, as 
they enable synthetic biologists to mimic 
natural biological systems, many of which 
are balanced between two competing in- 
puts, and to normalize inputs with respect 
to each other for biosensing and control 
applications. 
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Fig. 11. Analog circuits perform mathemat- 
ical functions in living cells. (a) An ana- 
log adder and (b) an analog ratio-meter 
were implemented through combinations of 


The design and construction of syn- 
thetic analog circuits in living cells is 
a novel approach that poses new chal- 
lenges. The accumulation of noise and 
the need for a high signal-to-noise ratio 
will be among the main challenges in 
scaling-up analog genetic circuits. How- 
ever, these challenges have been faced 
and addressed by scientists in other fields 
and their solutions can surely be adapted 
to synthetic biology. For example, hybrid 
analog—digital designs can optimize en- 
ergy efficiency and information precision 
in one system (see Ref. [171]}). 


7 
Intercellular Communication and Synthetic 
Multicellular Devices 


While the above-described circuits func- 
tion at the level of a single cell, there is 
growing interest in programming multi- 
cellular systems in which multiple cells 
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Positive logarithm 


P LacO 


mCherry 


positive-logarithm and negative-logarithm cir- 
cuits that accept distinct inputs (AHL and 
arabinose) and generate a shared output, 
mCherry. Modified from Ref. [171]. 


communicate with each other to accom- 
plish a specific task [184]. One reason for 
this is the limitation on the size and com- 
plexity of a circuit that can be built in a 
single cell, due to the crosstalk among cir- 
cuit components and the metabolic load 
imposed on the cell (see Sect. 8). Yet, the 
designers of multicellular systems may 
overcome this limitation by dividing a 
large, complex circuit into smaller sub- 
systems that are implemented in separate 
cell strains programmed to communicate 
with each other. In addition, populations 
of cells may carry out behaviors that can- 
not be implemented in a single cell, such 
as spatial patterning, thus opening up the 
possibility of practical applications such 
as tissue patterning for transplants. Cur- 
rently, there is also much interest in de- 
veloping consortia of multiple microbial 
strains or species that can cooperate in the 
synthesis of biofuels and other valuable 
chemicals [185]. 


Lal 
Intercellular Communication Mechanisms 


In a multicellular system, cells must be 
able to send and receive signals among 
each other. Multicellular synthetic cir- 
cuits have harnessed the molecules that 
cells use naturally to communicate. One 
well-known example is quorum sensing 
(QS), a process by which bacteria commu- 
nicate with each other via the diffusion of 
small molecules that they synthesize. A 
QS module includes an enzyme that syn- 
thesizes the diffusible signaling molecule 
and a TF that regulates its target genes 
when bound to the signaling molecule. 
In E. coli, commonly used orthogonal QS 
systems are LuxR/Luxl from Vibrio fis- 
cheri and LasR/LasI or RhIR/RhlI from 
Pseudomonas aeruginosa [15, 16]. The Luxl 
enzyme synthesizes the diffusible signal- 
ing molecule, an AHL which, when bound 
to the LuxR TF, causes gene expression 
from the Pj,, promoter. Other forms of 
intercellular communication exist among 
eukaryotes: S. cerevisiae cells communicate 
using mating pheromones [186], while 
mammalian cells possess a suite of sig- 
naling proteins that trigger complex sig- 
naling cascades and elaborate behaviors 
in recipient cells [187-192]. Alternatively, 
a cell may be engineered to synthesize 
a molecule that does not normally func- 
tion in signaling, and that molecule may 
then diffuse out of the sender cell and 
into a recipient cell engineered to per- 
form a specific action in response to the 
input. 

One disadvantage of small-molecule- 
based intercellular communication is 
that only one type of information can 
be transmitted through a given channel 
(namely, high versus low concentrations 
of the signaling molecule) [193]. More 
advanced multicellular devices could 
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be programmed if the communica- 
tion channel could transmit diverse 
user-defined messages. An obvious candi- 
date for such communication channel is 
intercellularly transmitted DNA. In a re- 
cent study [193], communication was engi- 
neered between E. coli cells via M13 bacte- 
riophages, which exit the host cell without 
killing it and can transmit multi-kilobase 
single-stranded DNA molecules between 
cells. In this proof-of-concept study, the 
authors programmed sender cells with 
M13 phages encoding T7 RNAP. Upon 
receiving the phage message, the receiver 
cells activate GFP reporter expression 
from a T7 RNAP-responsive promoter 
[193]. Another possibility for DNA-based 
intercellular communication would be to 
use bacterial conjugation, the process by 
which bacteria exchange plasmids. While 
this has not yet been implemented, a 
recent computational simulation suggests 
that digital logic gates may be constructed 
by coculturing bacterial strains that 
can exchange plasmid-based messages 
[194]. 


7.2 
Examples of Synthetic Multicellular 
Systems 


Some multicellular systems are isogenic: 
all cells in the system are programmed 
with the same genetic circuit, and they 
communicate with each other to perform 
a specific task [184]. In an isogenic system, 
intercellular communication provides a 
way for cells to synchronize their behav- 
ior and to form spatial patterns. In one 
example, bacterial QS was used to couple 
and synchronize oscillators (see Sect. 3.1) 
across a bacterial population [195] (Fig. 12). 
It should be noted that, at low cellular con- 
centrations, oscillations were not observed 
as the concentration of the QS molecule 
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Fig. 12 Using quorum sensing (QS) to cou- 
ple and synchronize oscillators in an E. coli 
population. (a) Circuit diagram of the syn- 
chronized oscillator [195]. AHL is a diffusible 
QS molecule produced by the Luxl enzyme. 
The LuxR TF is expressed from a constitutive 
promoter (Pconst) and complexes with AHL to 
make LuxR:AHL, which activates the Pj, pro- 
moter. AiiA catalyzes the degradation of AHL. 


AHL was too low to sufficiently activate 
gene expression. However, concentrating 
the bacterial cells in a microfluidic cham- 
ber allowed AHL to reach sufficient levels 
to trigger oscillations [195]. In a related 
study [196], a collection of contained cellu- 
lar populations arranged in a grid was syn- 
chronized using hydrogen peroxide that 
had been synthesized by the cells and dif- 
fused among the cell populations. 
Multicellular systems allow more so- 
phisticated computation than is possible 
in a single cell, such as spatial pattern- 
ing in a cell population. An outstanding 
example is the bacterial edge detection 
circuit [197]. By coupling a light sensor 
(Cph8), logic (NOT and AND) gates, and 
QS modules, the authors programmed a 
population of E. coli cells to detect and 


AHL#( Osc) 


As AHL accumulates, more AiiA is produced 
from the Pj,, promoter, leading to degradation 
of AHL; subsequent loss of AiiA expression 
allows AHL levels to rise again after a delay, 
leading to oscillations; (b) AHL is used to 
couple together the dynamics of oscillators 

in different cells across the population. Osc, 
oscillators. Panel (a) modified from Ref. [195]. 


outline the (dark-light) edges of a projected 
image on a lawn of bacteria (Fig. 13). In 
this example, a cell produces an output 
(pigment) only if the cell itself is exposed 
to light whilst its neighbors, which com- 
municate with it via QS, are in the dark 
[197]. In another study, E. coli cells were 
programmed to form a pattern of alternat- 
ing stripes of high cell density and low cell 
density on a culture plate by coupling QS 
to cell motility [198]. 

Unlike isogenic cell systems, some mul- 
ticellular systems consist of two different 
cell strains, each engineered with a dif- 
ferent circuit. Such multistrain consortia 
can be used for constructing digital logic 
gates (see Sect. 5). The construction of 
multilayered logic gates in single cells is 
hampered by crosstalk among different 


Fig. 13. The bacterial dark/light edge de- 
tection circuit in E. coli [197]. (a) Cph8 is a 
light-sensing protein, which can activate the 
Pompe promoter only in the absence of red 
light. As a result, only cells in the dark can 
express Luxl and synthesize AHL molecules, 
which then diffuse to neighboring cells and 
allow LuxR to activate expression of the lacZ 
reporter. Cells in the dark also express the A 
Cl transcription repressor from a Pompe Pro- 
moter. Only cells that do not express Cl and 


parts of the circuit, as well as by the large 
metabolic load that a large genetic circuit 
places on the host cell (Sect. 8). Both prob- 
lems may be circumvented by distributed 
multicellular computation, whereby each 
subpopulation processes a specific logic 
gate and produces output in the form of 
a diffusible small molecule, which in turn 
can diffuse through the cell population 
and act as an input for a logic gate in 
another cell. Using this distributed com- 
puting strategy, Tamsir et al. constructed 
all 16 possible two-input logic gates in 
E. coli [199]. In another example, popu- 
lations of yeast cells, each encoding an 
individual logic gate, were connected via 
diffusible pheromone “wires” to make 
higher-order digital logic circuits, includ- 
ing a multiplexer and 1-bit adder [172]. 
In both of these complex circuits layering 
of the gates was made possible through 
controlled interactions of subpopulations 
of cells. 
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Computed edge 


Light Dark 


(c) 

that receive AHL from nearby cells can ex- 
press lacZ from the Piyxx — 4% hybrid promoter. 
The expression of lacZ, which encodes a blue 
pigment-producing enzyme, occurs only at the 
edge of light and dark, in cells that receive 
light (no Cl expressed) and are near cells in 
the dark (receive AHL); (b) Truth table for the 
circuit shown in panel (a); (c) A schematic 
representation of the computed dark/light 
edge (gray) on a lawn of bacteria. Modified 
from Ref. [197]. 


The first synthetic intercellular com- 
munication device in mammalian cells 
used sender cells engineered to synthe- 
size the volatile molecule acetaldehyde, 
which diffused into neighboring receiver 
cells and triggered reporter gene tran- 
scription from an acetaldehyde-responsive 
promoter [200]. Replacing the mammalian 
sender cells with acetaldehyde-producing 
E. coli or S. cerevisiae cells allowed in- 
terspecies signaling [200]. A subsequent 
study demonstrated two-way communica- 
tion in mammalian cells, whereby one 
cell strain produced t-tryptophan and ex- 
pressed a reporter gene in response to 
acetaldehyde, while the second cell strain 
synthesized acetaldehyde and responded 
to L-tryptophan [201]. Using this bidirec- 
tional system, the authors programmed 
the cell consortium for sequential pro- 
duction of angiopoietin-1 and vascular 
endothelial growth factor, two proteins 
required for the formation of mature 
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blood vessels [201]. This study presents 
an example of how synthetic intercel- 
lular signaling may be used for tissue 
engineering, with potential therapeutic 
benefits. 


8 
Synthetic Circuit Construction: Challenges 
and Solutions 


Synthetic biology aims to make the de- 
sign of biological devices as predictable 
as in other disciplines, such as mechani- 
cal or electrical engineering. While great 
progress has been achieved, this goal re- 
mains elusive, due to the immense com- 
plexity and an incomplete understanding 
of living organisms. In order for a synthetic 
circuit to function correctly, the following 
conditions must be met [202]: 


e Each component of the circuit (TF, 
promoter, regulatory RNA element, and 
others) should function as expected. 

e The circuit design and the parameters 
of the circuit components (e.g., the rates 
of TF synthesis and degradation) are 
suitable to the specified task. 

e The circuit avoids unwanted interac- 
tions with the host or the host’s envi- 
ronment that disrupt circuit function or 
impair the host’s viability. 

e The circuit is robust, that is, it is capable 
of maintaining function in the presence 
of intrinsic and extrinsic noise. 


Synthetic building blocks and their func- 
tions are described in Sect. 2. Below are 
discussed various aspects of circuit design, 
circuit—host interactions, and robustness 
to noise. Further discussions of synthetic 
circuit troubleshooting are available in re- 
cent reviews [202, 203]. 


8.1 
Circuit Design: Topology and Parameters 


Previous studies in synthetic biology have 
shown that circuit topology (the way in 
which the circuit components are wired 
together) and the parameters of the circuit 
components qualitatively affect circuit be- 
havior. For example, the damping behavior 
of an oscillator depends on its topology: a 
repressilator that consists of two genes re- 
pressing each other shows damping after 
a limited number of cycles, but an oscilla- 
tor with an amplifier loop shows sustained 
oscillations [121] (Sect. 3.1). In the case of 
a toggle switch, changing the relative ex- 
pression levels of the two transcriptional 
repressors can change the behavior of the 
circuit from a switch into a timer [137]. 
Recently, a gene circuit in E. coli was 
converted from analog to digital function 
through adjusting plasmid copy numbers 
with a small-molecule inducer [171]. 
Computational modeling of different 
circuit topologies and sets of parameters 
has been used to identify network mo- 
tifs that are necessary for, or enriched in, 
circuits that carry out desired functions. 
For example, the computational model- 
ing of 1.6 x 10° three-node enzyme-based 
networks has been carried out to identify 
networks capable of adaptation (defined 
as an initial response to a stimulus, fol- 
lowed by return to the initial state even 
in presence of continued stimulus, a com- 
mon property of sensory systems) [204]. 
The authors found that all 395 networks 
that display robust adaptation shared ei- 
ther one or two of the following motifs: 
a negative feedback loop or an incoherent 
feedforward loop (see Sect. 8.3) [204]. In 
another study, Lim and colleagues used 
computational modeling to identify candi- 
date network motifs that can induce cel- 
lular polarization, a vital property of living 


cells that is a prerequisite to behaviors 
such as directional migration [205]. These 
authors used the predictions to construct 
synthetic networks that trigger asymmet- 
ric accumulation of the membrane phos- 
pholipid PIP3 in the cell membrane of 
S. cerevisiae cells. Robust polarization was 
accomplished using networks that com- 
bined positive feedback with mutual in- 
hibition between the synthetic proteins 
[205]. 

Another important consideration in cir- 
cuit design includes level matching: in any 
circuit where the output of one part of the 
circuit serves as the input for another part, 
the concentration of the output must be 
in the correct range to trigger the desired 
response from the downstream part of the 
circuit. Level matching may be achieved in 
different ways, such as adjusting the copy 
number of a gene, or mutating its RBS to 
affect translation initiation rates and hence 
the level of protein expression [30]. 

Moreover, the time required for each 
layer of the circuit to process its input 
and produce an output must be consid- 
ered [206]. The time delay due to biolog- 
ical processes such as transcription and 
translation affects the timing of its re- 
sponse, and may limit the complexity of 
circuits that can be implemented in living 
cells. For example, RNA-based cascades, 
which do not require translation, are faster 
than TF-based regulatory cascades [87]. 
Protein-based signal transduction path- 
ways are faster still, but the ability to design 
them is still rudimentary despite advances 
made over the past few years [118]. 

The response time of a transcription 
regulator may be shortened by using a 
negative autoregulatory feedback loop, in 
which the TF represses its own promoter. 
This strategy relies on using a strong 
promoter, which drives TF expression at 
a high level when initially induced. When 
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the TF has accumulated above a threshold 
it binds and represses its own promoter. 
This combination ofa rapid initial build-up 
in TF concentration followed by negative 
autoregulation allows the TF to reach its 
steady-state level very quickly [207]. 

In order to design a circuit with the de- 
sired behavior, synthetic biologists take ad- 
vantage of mathematical modeling. Com- 
monly used mathematical modeling tech- 
niques include sensitivity analysis, which 
quantifies the effect of a parameter (or pa- 
rameters) on overall circuit performance, 
and bifurcation analysis, which identifies 
the boundary in parameter space that sep- 
arates circuits with qualitatively different 
behaviors, such as a stable steady state ver- 
sus an oscillator [208]. Multiple software 
programs are available for modeling cir- 
cuit behavior in silico (for reviews, see Refs 
[162] and [208]). As circuits become more 
complex, however, reliance on mathemat- 
ical models for their design will increase. 
For example, Purcell et al. recently de- 
scribed a platform for simulating synthetic 
circuits in the context of whole-cell models 
[209]. 


8.2 
Circuit—Host and Circuit—Environment 
Interactions 


Unwanted interactions between the cir- 
cuit and its host cell may lead to cir- 
cuit failure or cell death. For example, 
a component of the circuit may be toxic 
to the host cell and indeed, a recent 
study identified over 15 000 heterologous 
genes as toxic to E. coli [210]. Con- 
versely, endogenous host genes or proteins 
may interfere with circuit function; for 
example, a native DNA sequence may 
bind a sTF and titrate it away from 
its target promoter. An increased knowl- 
edge of the host cell’s regulatory and 
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metabolic networks through the analysis 
of large-scale datasets will help syn- 
thetic biologists in designing circuits that 
will function as expected in a given 
host. 

Even when none of the circuit compo- 
nents is toxic to the host, the designers 
must consider the metabolic load that the 
circuit imposes on the host cell in terms of 
requirement for ATP, RNAPs, ribosomes, 
and nucleotides [211, 212]. A metaboli- 
cally expensive circuit may interfere with 
the cell’s normal function and place selec- 
tive pressure on the host to inactivate the 
synthetic device. 

Certain circuit designs may help to al- 
leviate metabolic load. Different ways of 
implementing the same type of function 
may have different energy requirements; 
for example, a memory device based on 
one-time recombinase-mediated DNA in- 
version is less metabolically demanding 
than a memory device that requires con- 
tinued protein synthesis [153]. Moreover, 
regulatory RNA devices, which do not 
rely on the host’s translation machinery, 
have a smaller metabolic footprint than 
TF-mediated regulation [87]. Another way 
to reduce the metabolic burden on any 
single cell is to distribute a large circuit 
among several strains of cells by placing a 
smaller subcircuit in each cell strain [172, 
199] (Sect. 7). While powerful, distributed 
computing adds a level of complexity to 
the circuit design; the investigator must 
consider not only the effect of the cir- 
cuit on each host strain but also the way 
in which two or more host strains inter- 
act to accomplish their task. For example, 
if cells of different strains communicate 
via a diffusible molecule, they must be in 
close physical proximity in order for the 
receiver cell to detect the signal [199]. If 
distributed computing is carried out by 
multiple engineered strains in a common 


culture medium [172], the computation 
may fail if one of the strains outcompetes 
the others. 


8.3 
Noise and Robustness 


Living cells are noisy, due to differences 
among individual cells (or in a single cell 
over time) in parameters such as the num- 
ber of TFs, mRNAs, and ribosomes; cell 
volume; state of cell cycle; and chromatin 
modifications [213]. Every component of a 
genetic circuit — whether synthetic or natu- 
ral — experiences and propagates this noise 
to some extent, which in turn can be ampli- 
fied by the noise from other components 
of the circuit. This will intensify the overall 
noise and hence may disrupt the perfor- 
mance of a genetic circuit. Moreover, the 
cell must cope with extrinsic noise, such as 
fluctuations in nutrient availability, tem- 
perature, pH, and other environmental 
variables. Notably, many practical appli- 
cations (e.g., medicine or bioremediation) 
will require the engineered cell to function 
in a more unpredictable environment than 
a culture flask in the laboratory. Hence, a 
synthetic device must be able to minimize 
noise where possible and to maintain a 
desired function in the presence of un- 
avoidable noise; this property is called 
robustness [214]. 

Over long time scales, evolution has 
selected for robustness in natural gene 
circuits [215]. However, in synthetic cir- 
cuits, which lack an evolutionary tuning 
process, noise can blur the desired output 
and therefore is not usually considered 
a favorable factor. Without a mechanism 
for controlling noise, increasing the com- 
plexity of a circuit will likely increase the 
uncertainty of the output. 

The noisy nature of biological systems 
is part of the reason why many synthetic 


circuits employ digital logic (see Sect. 5). 
Digital designs reduce the effect of noise 
by reducing the number of outputs of a 
given component of a circuit to two states: 
TRUE and FALSE. Ideally, the states are 
defined in such a way that passing from 
one state to another requires a significant 
change in concentration of input(s) of the 
component (e.g., an inducer or a TF), 
and therefore transition between states 
by cellular noise alone is very unlikely. 
Given that the number of distinguishable 
states that a circuit can possess can be 
considered a measure of complexity of the 
circuit, reduced complexity is a price that 
digital designs pay to overcome cellular 
noise. Digital gene circuits are generally 
more robust with respect to cellular noise 
than their corresponding analog circuits; 
however, they require more components 
than an analog circuit to achieve the same 
level of complexity, and hence they place a 
higher metabolic load on the cell [216]. 
Computational and experimental anal- 
ysis of naturally occurring gene network 
motifs has revealed the robustness of cer- 
tain motifs. For example, negative autoreg- 
ulation, whereby a TF represses transcrip- 
tion of its own gene, helps to minimize 
noise in the TF expression levels: cells that 
initially have more TF will produce less of 
it, and cells that initially have less TF will 
produce more [217]. Another example of 
a motif that occurs commonly in natural 
gene regulatory networks is the feedfor- 
ward loop (FFL), which consists of three 
nodes: node X regulates node Y, and X 
and Y regulate node Z. In coherent FFLs, 
the regulatory interaction (either activat- 
ing or repressing) between X and Z is 
the same for both branches of the reg- 
ulatory pathway; that is, if X activates Z 
directly, then X also activates Z via Y. In 
an incoherent FFL, the direction of the 
regulatory interaction is different for the 


Synthetic Gene Circuits 


two branches of the pathway so that, for 
example, X activates Z directly and inhibits 
Z via Y [218]. Mathematical modeling in- 
dicates that the coherent type 1 FFL, which 
consists entirely of positive interactions — 
X activates Y, and X and Y both activate 
Z — is the most robust to noise among 
the coherent FFLs [219]. The robustness 
of the coherent type 1 FFL may account 
for the large number of times this motif 
occurs in the gene regulatory networks of 
E. coli and S. cerevisiae [219]. Synthetic cir- 
cuit design may benefit from using robust 
genetic motifs such as the coherent type 1 
FFL that occur repeatedly in many natural 
systems, providing them with properties 
such as fast response time, robustness, 
or memory (see Ref. [218] for a detailed 
discussion of network motifs). 


8.4 
Evolution 


In addition to making a system robust 
to short-term fluctuations in the levels 
of metabolites, circuit components, and 
so forth, synthetic biologists must also 
consider the impact of evolution on the 
function of the circuit over time. While 
evolution has been harnessed to improve 
circuit function and develop libraries of 
diverse components [220], it is also prob- 
lematic in synthetic biology: circuits un- 
dergo point mutations, rearrangements, 
and deletions that disrupt their function, 
and may be lost from the host cell al- 
together after a number of generations. 
Repetitive sequences, high metabolic load, 
high plasmid copy number, and a quickly 
replicating host cell such as E. coli increase 
the probability of the circuit being lost or 
mutated [214]. 

The issues of circuit—host interactions, 
robustness and evolution are interrelated 
(e.g., a high metabolic load is likely to lead 
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to evolutionary instability of the circuit), 
and may present trade-offs. For example, a 
reduction in protein level noise appears 
to come at the cost of higher energy 
requirements [221]. Synthetic biology is 
an iterative process: successes and failures 
in circuit construction teach about the 
mechanisms of gene network function, 
and these lessons can then be applied to 
the next round of circuit modeling and 
design. A deeper understanding of living 
systems, larger and better-characterized 
component libraries, and better models 
will help synthetic biology to advance the 
construction of predictable, robust circuits 
for practical applications. 


9 
Applications of Synthetic Circuits 


Synthetic gene circuits used in applica- 
tions must function reliably in adverse 
and often undetermined conditions. Con- 
sequently, applications push the field for- 
ward by challenging synthetic biologists 
to design robust, efficient circuits [7]. 
Examples of three types of practical ap- 
plications — biosensors, therapeutics, and 
biomanufacturing — are presented below. 


9.1 
Biosensors 


Biosensors translate the concentration of 
a specific analyte in the environment into 
a measurable signal by combining a living 
cell with a hardware platform that enables 
detection [222]. The output is usually a 
measurable signal, such as GFP or the blue 
pigment produced by the f-galactosidase 
enzyme. Biosensors may be judged by 
their selectivity for binding the target ana- 
lyte, their sensitivity for low concentrations 
of the analyte, the input dynamic range 


of analyte concentrations, and the output 
signal-to-noise ratio. Many biosensors use 
transcription-factor-based circuits to mea- 
sure the concentration of a toxic com- 
pound such as arsenite [196, 223, 224]. 
The goal is to use these biosensors in the 
developing world for testing water before 
its consumption. These field conditions 
make standardized measurements chal- 
lenging, because the physiological state of 
the living cells in the biosensor is highly 
variable. To overcome this, Wackwitz et al. 
[224] developed a series of complementary 
cell strains that are tuned to respond to dif- 
ferent arsenite concentrations by changing 
the strength of the RBS in front of the re- 
porter gene. When calibrated and used 
in concert, these strains greatly improved 
arsenite detection. Rather than having 
the steady-state expression level of a re- 
porter protein as their output, Hasty and 
colleagues [196] developed an oscillator 
whose frequency varies as a function of 
arsenite concentration; this decoupled the 
biosensor from the imaging conditions, 
such as beam power and exposure time. 
Further, the biosensor was an array of bac- 
terial colonies in a microfluidic device that 
synchronized their oscillations at both the 
micro and macro-scale through diffusible 
and gaseous vapor molecules, respectively, 
enabling an accurate, high-strength signal 
that allows the biosensor to function as a 
handheld device. 

In addition to cellular biosensors, syn- 
thetic biology has enabled the develop- 
ment of real-world deployable microbial 
sensors based on engineered phages. For 
example, Sample6 Technologies is com- 
mercializing a near-real-time microbial 
pathogen detection system based on engi- 
neered phages [225]. Bacteriophages (‘“‘bac- 
teria eaters”’) are viruses that infect specific 
species and strains of bacteria [226]. By 
engineering phage libraries to express 


reporter genes during the infection of 
target bacteria, the presence or absence 
of pathogens can be detected. Reporter 
phages have been designed for the de- 
tection of clinically relevant pathogens 
such as Mycobacterium tuberculosis, Staphy- 
lococcus aureus, and food-borne pathogenic 
E. coli [227]. The advantages of engi- 
neered phage diagnostics include high 
sensitivity, high specificity, and short 
time-to-detection, which are shortcom- 
ings that are not addressable by con- 
ventional microbial detection approaches 
such as polymerase chain reaction (PCR), 
immunoassays, and culture. Moreover, 
phages specifically detect live bacteria that 
support phage replication, whereas some 
other rapid detection methods, such as 
PCR, do not distinguish between live and 
dead bacteria [227]. The tools of synthetic 
biology enable a rapid design—build-test 
cycle for new diagnostic systems, such as 
engineered phages, thus allowing assays to 
be built, tested and improved for real-world 
applications. 


9.2 
Therapeutic Applications 


Synthetic biology holds tremendous 
promise in the area of drug development, 
diagnosis and treatment of disease [13]. 
Cell-based therapeutics constitute an 
emerging therapeutic class, and central to 
their efficacy are synthetic gene circuits 
that process environmental signals and 
actuate treatment, enabling appropriate 
selectivity, distribution, and dosage [13]. 
Microbial cells have been engineered 
to combat infectious diseases by either 
killing [228] or downregulating the 
growth [229] of pathogenic organisms. 
Both therapeutics rely on QS circuits 
(see Sect. 7) that trigger the release of 
active agents only in the presence of the 
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target pathogen’s quorum signal. Such 
targeted killing is a promising application 
for synthetic biology, as antibiotic 
resistance becomes more prevalent and 
broad-spectrum antibiotics lose favor. 
Microbes have also been engineered to 
invade cancer cells [166] using a synthetic 
gene circuit that triggers invasion only 
when the cells reach a critical density 
(detected via QS), or when they sense the 
hypoxic environment that is a hallmark of 
cancer cells with hyperactive metabolism. 
Furthermore, microbial-based therapies 
are an increasingly promising area of 
research as more information is gleaned 
about the microbiome and the role that 
symbiotic microorganisms play in human 
physiology. 

Other studies have focused on devel- 
oping mammalian cell-based therapeu- 
tics. Engineered cells may be enclosed in 
a semipermeable capsule, which allows 
them to be implanted in the body and to 
interface with human physiology, while 
at the same time being isolated from di- 
rect contact with the patient’s tissues to 
prevent an immune response or metas- 
tasis. This microencapsulation technique 
was used in a system for regulating urate 
homeostasis to combat tumor lysis syn- 
drome and gout, two diseases caused by 
abnormally high urate levels [24]. A syn- 
thetic gene circuit in the mammalian cells 
within the microcapsule utilizes a bacte- 
rial TF to sense urate concentrations and 
control the expression of an enzyme that 
degrades urate. This system restored urate 
homeostasis in a mouse model of acute hy- 
peruricemia [24]. A similar study utilized 
a light-inducible gene circuit to control 
the production of glucagon-like peptide 1 
and reduce glycemic excursions in type II 
diabetic mice [160]. 

A T-cell-based system represents an- 
other form of emerging mammalian 
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cell-based therapeutic. In this case, the 
immune cells are removed from a patient, 
genetically engineered to target pathogens 
or cancer cells, and then transferred back 
into the patient. However, these therapies 
suffer from side effects such as hyperactiv- 
ity and autoimmune off-target attacks, and 
would benefit from a synthetic control of 
proliferation. Two pioneering studies have 
demonstrated control over T-cell prolifer- 
ation with synthetic RNA-mediated [230] 
or signaling protein-mediated [119] regu- 
lation of T-cell replication. 

Synthetic gene circuits can also target 
therapeutics to different cell types via clas- 
sifier circuits. Two examples demonstrate 
this approach to target cancer cells. In the 
first example, Nissim and Bar-Ziv [22] con- 
structed a transcription-based AND gate 
that only expresses its therapeutic payload 
in cells in which both input promoters 
are on. In this design, two input promot- 
ers active in a specific cancer cell type are 
used to drive the expression of two fusion 
proteins, which form a transcription acti- 
vator complex that activates the expression 
of a cytotoxic effector protein. Different 
input promoters may be chosen to target 
different cancer cell types. The use of two 
input promoters rather than one allows 
greater flexibility in the choice of input 
promoters, and produces a sharp activa- 
tion threshold between premalignant and 
cancer cells, with the magnitude of the 
response increasing in more malignant 
cell lines. In order to avoid unwanted ac- 
tivation of the circuit in healthy cells, the 
level of effector gene expression may be 
tuned using point mutations in one of 
the fusion proteins to lower the efficiency 
of the synthetic TF complex formation. 
The digital logic circuit provides a pre- 
cise and efficient way to target specific 
cell types while minimizing off-target ef- 
fects on healthy cells [22]. In a second 


example, Xie et al. [23] built a classifier cir- 
cuit that determines whether the levels of 
six different miRNAs match the reference 
profile of cancer cell miRNA expression, 
and based on this information controls 
expression of the BAX protein, which trig- 
gers apoptosis (see Sect. 5 and Fig. 7). 
Such deliverable cell-based classifiers uti- 
lizing synthetic gene circuits will play an 
important role in targeting therapeutics 
for difficult problems such as cancer or 
gene therapy. 

In other examples of cancer therapy, 
oncolytic viruses were optimized for 
cancer targeting with improved specificity 
[156, 231]. Recently, synthetic constructs 
linking diphtheria-toxin gene expression 
under the control of the H19 promoter 
were tested in human patients [232]. 

Synthetic biology may also be applied 
to combating antibiotic-resistant bacteria, 
which present an emergent health threat 
worldwide [233]. The shortage of effective 
new antibiotics [234] necessitates the 
development of novel therapies. One 
possibility is to use synthetic biology to 
engineer phages (viruses that naturally in- 
fect bacteria) to combat antibiotic-resistant 
bacterial strains [235]. The engineered 
phages may be used to target biofilms, 
surface-associated bacterial communities 
encased in an extracellular matrix of 
polysaccharides and proteins that protects 
the bacteria from antibiotics and from 
the patient's immune system [236]. For 
example, T7 bacteriophage engineered 
to express Dispersin B, an enzyme that 
hydrolyzes a key biofilm component, 
efficiently disrupts E. coli biofilms [237]. 

Phages may also supplement antibiotic 
therapy. In one study, M13 phages were 
engineered to overexpress genes predicted 
to make host cells more vulnerable to 
antibiotics; these genes included lexA3, 
which inhibits the bacterial DNA damage 


response, leaving the cell vulnerable 
to DNA damage-inducing antibiotics; 
csrA, which inhibits biofilm formation; 
and ompF, which encodes a membrane 
protein through which antibiotics can 
enter the cell [238]. A combination of 
each engineered phage and antibiotic kills 
E. coli significantly more efficiently than 
either a combination of antibiotic and 
control (unmodified) phage or antibiotic 
alone [238]. Notably, M13 phage does not 
kill the host cell in absence of antibiotics, 
and hence it is less likely to give rise to 
bacterial resistance, which presents an im- 
portant problem in antibacterial therapy 
[235]. In addition to resistance, many other 
challenges remain on the road to effective 
phage-based therapy, including potential 
side effects, the need for the phage to 
evade the patient’s immune system, 
and a limited phage host range [235]. 
Nonetheless, the studies described above 
suggest that phage-based therapy is a 
promising strategy to pursue in combating 
antibiotic-resistant bacteria [235]. 

Finally, synthetic circuits can be utilized 
for the discovery and optimization 
of novel pharmaceuticals, such as 
new antituberculosis compounds [239] 
and novel classes of treatments for 
antibiotic-resistant-bacteria, such as lysins 
(phage-derived proteins that lyse bacteria) 
or bacteriocins (small peptides that kill 
bacteria by forming pores in their cell 
membranes) [233]. Synthetic biology also 
holds the potential for improving the yield 
of biopharmaceuticals and other valuable 
chemicals, as described below [233]. 


9:3 
Synthetic Biology in Manufacturing 


Synthetic biology has revolutionized in- 
dustrial biotechnology by allowing engi- 
neers to optimize living cells rationally 
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for the production of pharmaceuticals and 
other valuable chemicals [240]. Central to 
these efforts is repurposing biosynthetic 
genes from various organisms and tuning 
their level of gene expression. Often, gene 
expression is engineered to be static, so 
that it is constant throughout the course 
of production, or it is controlled with a 
simple gene circuit, in which an externally 
added small molecule induces a consti- 
tutively expressed TF to switch on the 
biosynthetic genes. However, these sys- 
tems suffer from the metabolic burden 
that they exert on cells, slowing cellular 
growth and thus impairing productivity. 
Complex biosynthetic systems would ben- 
efit from the dynamic regulation enabled 
by synthetic gene circuits, wherein the 
expression of biosynthetic genes is ad- 
justed based on a cell’s physiological state, 
allowing adjustments based on the con- 
centration of pathway intermediates or 
environmental conditions in the bioreac- 
tor such as nutrient availability, oxygen 
level, temperature, and cell-density [241]. 
Such dynamic regulation is akin to how 
cells naturally adjust their own physiol- 
ogy, and is expected to improve culture 
growth rates. 

In an early example of dynamic reg- 
ulation, the yield of lycopene in E. coli 
was improved by placing the expression of 
rate-limiting lycopene biosynthetic genes 
under the control of a TF that sensed 
excess glucose concentrations [242]. As 
a result, the pathway was only turned 
on when the cell had enough energy 
to continue growing, and this led to an 
18-fold higher lycopene production. More 
recently, Zhang et al. engineered circuits 
to control biosynthetic pathways using an 
approach called ‘dynamic sensor-regulator 
system” (DSRS) [243]. These circuits are 
based on TFs that bind pathway interme- 
diates, and affect synthetic promoters so 
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that the expression of biosynthetic genes 
is activated only when needed, at the level 
needed. Such an approach to biosynthetic 
gene regulation is theoretically applica- 
ble to many different pathways, given the 
number of known metabolite-binding TFs, 
and could be further expanded with the 
use of synthetic aptamers evolved to bind 
different molecules of interest (see Sect. 
2.5). An alternative method of improving 
the yield would be to tie the regulation of 
biosynthetic genes to the density of cells 
in the bioreactor in order to maximize 
cell growth before activating biosynthesis. 
This approach has been implemented with 
a QS circuit as an input to a toggle switch, 
and it has improved the yield of both re- 
combinant proteins [139] and metabolic 
molecules [244, 245]. 


10 
Conclusions 


The past two decades of synthetic biology 
have led to a tremendous explosion in the 
design of ever more powerful and complex 
synthetic gene circuits. These circuits 
confer a greater degree of control over 
engineered biological systems than was 
ever possible before. However, significant 
challenges remain in the design and 
application of synthetic gene circuits due 
to incomplete biological knowledge, slow 
design—build-test cycles, nonpredictive 
in-silico models, challenges in designing 
circuits that can function reliably in 
different contexts from the laboratory 
environment in which they were 
engineered [246], and outstanding issues 
concerning orthogonality, modularity 
and noise (for reviews, see Refs [202] 
and [203]). Yet, synthetic gene circuits 
clearly have much to offer society, and 
advances in fundamental circuit design 


and their implementation in biosensing, 
medicine, biomanufacturing and other 
application areas are expected to continue 
substantially during the next decade. 


Note Added in Proof 


While this chapter was under review, 
additional articles on synthetic gene cir- 
cuits have been published. Chen et al. [247] 
programmed E. coli for inducible produc- 
tion of amyloid protein-based extracellular 
fibrils carrying affinity tags for inorganic 
nanoparticles. As proof of principle, the 
authors demonstrated inducible forma- 
tion of a conductive biofilm composed of 
proteins decorated with gold nanoparti- 
cles. The system shows potential future 
use of synthetic biology for the formation 
of materials that combine the properties 
of living systems and inorganic matter to 
achieve novel functions [247]. Another re- 
cent study combined CRISPR gRNA and 
RNAi for tunable, multiplexed regulation 
of transcription in mammalian cells [248]. 
Notably, the study was the first to achieve 
inducible synthesis of gRNAs, opening 
up novel possibilities for inducible regula- 
tion of target genes by CRISPR [248]. In 
addition, a detailed protocol has been pub- 
lished for combining recombinase-based 
digital logic and memory in living cells 
[249], and two new reviews discuss features 
and potential applications of digital vs. 
analog synthetic gene circuits [250, 251]. 
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Keywords 


DNA 
Deoxyribonucleic acid. 


Origami 
The traditional Japanese art of paper folding. 


Nanorobotics 


The field of study concerned with designing, building and programming robots at the 


nanoscale (1 nanometer = 10~° meter). 


Nanobiotechnology 


The study of biological phenomena and entities at the nanometer scale. 


Nanomechanical engineering 


The design of machines at the nanometer scale. 


Synthetic biology 


The synthesis of new biological or genetic constructs not found in Nature, using natural 
biological or genetic parts of design principles. 


Robots augment the ability to automate the perception and control of reality. During 
the past century, robotics has become a multidisciplinary field, where insights 


from physics, biology, engineering and 


design have been integrated to drive an 


evolution of robotic devices for diverse tasks, from industry and mass production 
to scientific research and space exploration. Recent advances in the design and 


fabrication of nanoscale machines have 


enabled the introduction of man-made 


robots into the biology of living organisms, a realm that has so far remained 


essentially inaccessible to them. In this 
reviewed, starting with the roots that led 


chapter this relatively new field will be 
to its existence and its enablement. How 


robotics can revolutionize medicine and therapeutics, as they are currently known, 


will also be highlighted. 


1 
Introduction to Robotics 


Difficulties in actually defining a robot 
are rather surprising considering how 
essential robots have become in terms 


of humankind’s technology, society and 
culture during the past century. Robots 
can be defined in the broadest sense as au- 
tomata linking the perception and process- 
ing of environmental information to the 
performance of defined tasks. However, 


this definition seems to encapsulate many 
different entities, from living organisms 
and single cells to machines and virtual 
agents living in cyberspace, to the stage 
where this definition itself is pointless. Itis 
therefore perhaps better to define robotics 
in terms of its components, development, 
and implementations. 


Let 
A Brief History of Robotics 


In contrast to robots, the more general 
term machine can refer to a device that 
simplifies task performance or problem 
solving (for the purposes of this dis- 
cussion the thermodynamic definition of 
work, which is done by engines, can be ne- 
glected). As such, machines were probably 
used as early as 2.6 million years ago by 
hominines, in the form of simple stone 
tools. These later evolved into models de- 
signed for specific tasks (cutting, piercing, 
sewing, etc.) made from various materials. 
The discovery, later in documented his- 
tory, of basic machines such as the wheel, 
the lever, and the inclined plane, spawned 
new generations of machines for more 
elaborate and demanding tasks, enabling 
human technology, economy, and society 
to be scaled up. 

However, these machines were still just 
brainless tools that required a human user 
for their operation. The knowledge that 
paved the way to robotics in its mod- 
ern sense appeared during the Renais- 
sance, and involved various branches of 
mechanical kinematics (the study of mo- 
tion), applied physics, and engineering. 
Chronometry (the measurement of time) 
enabled the generation of automata that 
could perform series of defined tasks along 
a timeline, and simple energy sources such 
as a spring could ensure their sustained 
activity. In parallel, inventions such as the 


DNA Origami Nanorobots 


thermometer and barometer (seventeenth 
century), as well as advances in analyt- 
ical chemistry, enabled the introduction 
of sensing of environmental information 
into machines. The industrial age (late 
nineteenth century) already saw elabo- 
rate machinery, powered by high-energy 
sources (steam, fuel, electricity), which 
could perform faster, stronger, and more 
accurately than any human. Some of these 
machines were operated by instructions 
encoded on hardware such as punched 
cards (e.g., the Jacquard loom, 1801). 

Of particular interest is the emergence 
of the special category of machines known 
today as computers, as machines capable 
of carrying out a diverse set of tasks 
defined by a program specifying the map- 
ping of input to an output. The most 
well-known example of an early com- 
puter was the difference engine, designed 
by Charles Babbage, which was an as- 
toundingly ingenious mechanical device 
for automating polynomial calculations. 
Several decades later, Alan Turing will 
have outlined the first modern computer, 
the universal Turing Machine, which is 
still the most powerful model of computa- 
tion known to exist. By the mid-twentieth 
century, Turing, Von-Neumann, and oth- 
ers had already designed advanced elec- 
tronic computer architectures, setting the 
stage for the future fields of programming 
languages, algorithms, and artificial intel- 
ligence. 

What allows the integration of all the 
separate components seen so far — basic 
mechanical devices, sensors, a computer, 
a program, a clock, and so on-is the 
framework of cybernetics, coined by Nor- 
bert Wiener as “... the study of control 
and communication,” in machines, be 
they natural or artificial. Cybernetics is the 
wiring of machine components in a way 
that achieves control mechanisms such as 
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Fig. 1 Schematic abstraction of a robot. The 
robot is a machine linking the sensing of en- 
vironmental information to an actuation of 
some kind, via processing or computation. 
The framework of cybernetics defines the ways 
in which the various components connect with 
each other to achieve control and optimization 


feedback and timing, in order to optimize 
the function of the machine. 

It can now be seen that robots may 
be defined based on these components, 
as devices linking sensing to actuation 
by information processing, to perform 
tasks. Later, it will become clear how 
this definition is suitable to define robots 
operating at the nanoscale with a biological 
system as its nonlinear, highly dynamic 
environment (Fig. 1). 


1.2 
Robotic Control of Molecules 


Robots augment the ability to automate the 
perception and control of reality. Modern 
industry, scientific research, manufactur- 
ing and more, all obligate the utilization 


= 


J 


instructions 


(feedback, timing, etc.). The effects resulting 
from the robot's activity can be measured 
again as environmental information. User in- 
structions can be integrated into the processor 
(middle) or at any peripheral location (sen- 
sors, clock, etc.) 


of robots to cut costs, improve efficiency, 
and optimize performance. However, one 
realm is still largely inaccessible to robots, 
namely the biology of living organisms. 
Biological systems consist of small com- 
ponents such as molecules, molecular as- 
semblies and single cells, which interact 
on size and time scales that it have not, 
until very recently, been capable of ex- 
ploration. Moreover, biological interaction 
networks are among the most complex 
structures studied in Nature. Hence, to 
control the location and timing of the ac- 
tivity of molecules is a central requirement 
for biology. 

Every cellular process, such as growth, 
metabolism and signaling, is the result 
of molecular interactions — whether be- 
tween enzymes and substrates or between 


receptors and ligands — that are exquisitely 
regulated in time and space. In contrast, 
the ability to mimic this level of arbi- 
trary control at the molecular scale is 
exceedingly limited. This is particularly 
significant for the field of pharmacol- 
ogy and drug design. The ideal drug 
will act only at its designated target with 
the correct timing. Unfortunately, how- 
ever, most drugs currently in use have 
an enormous range of adverse effects due 
to a lack of correct spatiotemporal con- 
trol [1-4]. Indeed, whilst the currently 
available therapeutic arsenal includes very 
effective molecules, a lack of ability to cor- 
rectly control them hinders their safe and 
effective use. 

During the past four decades, remark- 
able progress has been made in the 
area of drug control. Currently, an im- 
proved spatial control of therapeutics is 
achieved mainly by conjugating a thera- 
peutic molecule to a target-specific carrier 
such as an antibody or an interleukin 
[5]. An alternative approach may involve 
the local implantation of a therapeutic 
molecule, but this is not always applica- 
ble. The temporal control of drug therapy 
is mostly achieved by embedding the drug 
in a matrix that hinders and/or prolongs 
its dissolution and subsequent diffusion, 
thereby causing a sustained and more sta- 
ble distribution of the drug [6]. Various 
elaborations of this approach have been 
proposed, including matrices comprised 
of multiple phases, which can release 
the drug in stepwise fashion [7]. Other 
matrices can be triggered externally by 
physical means, such as infrared laser or 
ultrasound [8], although even these modifi- 
cations will not always improve the drug’s 
performance [9]. 

While clearly representing improve- 
ments over conventional, unmodified drug 
administration (which currently accounts 
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for most therapeutics used), many desired 
features are not enabled by these tech- 
niques. Examples of such features include: 
(i) reversing the availability of a molecule 
at will, for example by sequestering or 
shielding it reversibly; (ii) coordinating 
the activity of two or more molecules by 
turning one molecule off as the other 
is turned on, and vice versa, such that 
they do not collide or compete with each 
other; or (iii) activating a molecule by a 
complex combination of biological condi- 
tions, for example, a target cell expressing 
molecules A, B, and C but not D and not 
E. Cells routinely exhibit such features, 
and the application of an equivalent level 
of expertise in therapeutics would lead to 
a paradigm shift in how drugs are used 
and designed. Moreover, it would directly 
lead to significant improvements in the 
safety and efficacy of many drugs in use 
today. 

Cells can be thought of as nanoscale 
computers, which generate precise out- 
puts based on molecular inputs [10]. This 
analogy holds a potential key to solve 
the challenge just described. Unlike the 
fields of pharmacology and drug design, 
computers have experienced a meteoric 
progress since the mid-twentieth century, 
which is described by Moore’s law. In fact, 
computers readily enable high levels of ar- 
bitrary control over processes, much like 
cells do at the molecular scale. A computer 
capable of reading biomolecular inputs, 
and linking its output to drug release, 
could arguably solve the challenge of drug 
control. Until now, however, it has not 
been possible to implement computing 
capabilities to the nanoscale, due mainly 
to the uncertainty of how such computers 
should be designed and programmed in 
the first place. Moreover, it has not been 
clear how computers could interface with 
living cells. 
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2 
DNA Nanotechnology 


2.1 
Computing 


During the past two decades it has become 
clear that computers could be built from 
molecules, of which DNA would be a natu- 
ral substrate for universal computing. The 
tractability of Watson—Crick pairing en- 
ables the programming of self-assembling 
DNA interactions for a diverse set of 
computing paradigms such as tiling [11], 
neural networks [12], and cellular au- 
tomata [13]. In 1994, Adleman demon- 
strated a solution to the Hamiltonian path 
problem, a nondeterministic polynomial 
time (NP)-complete problem, using DNA 
strands that represented the vertices and 
paths of a seven-node graph [14]. The 
strand set representing the specific graph 
in question was mixed in a tube and al- 
lowed to self-assemble into a longer strand 
that would encode the solution, effectively 
emulating massively parallel computing 
as the strands were allowed to generate 
every possible solution. DNA computing 
has since been applied to a variety of 
computational problems [11, 12, 14-22], 
although its scalability has been ques- 
tioned [23], as the complexity of solvable 
problems is limited by the amount of 
DNA required to carry out the computation 
process. 

The seamless integration of molecu- 
lar computing into a biological system 
was first demonstrated by Benenson, who 
described DNA computers that could rec- 
ognize gene expression patterns in tar- 
get cells and activate a desired genetic 
response in a logical fashion [24, 25). 
However, responding exclusively to nu- 
cleic acids as inputs ostensibly limits 
the types of cues sensed by the system; 


moreover, such a computer would have 
to be delivered into cells or genetically 
encoded in order to function. Although 
nucleic acids are natural inputs for a 
DNA-based system, the introduction of 
aptamers (short nucleic acid sequences se- 
lected to recognize specific epitopes [26]) 
made possible the sensing of virtually any 
type of biological molecule, outside the 
cell, by nucleic acid-based sensors. 

On the other hand, DNA-DNA inter- 
actions have provided a robust platform 
for molecular computing and logic. Two 
DNA strands competing on the same 
complementary strand can produce pro- 
grammable kinetics, which led to the 
development of systems based on strand 
displacement reactions [27-29], such as 
a polymerase chain reaction (PCR) per- 
formed without temperature changes [30]. 
Strand displacement reactions were re- 
cently shown to be scalable, and have been 
utilized in the construction of neural net- 
works capable of complex computations 
[12, 31, 32]. DNA strand displacement 
kinetics has also enabled the design of 
strands capable of “walking” along a DNA 
track made from complementary strands, 
with some of these actually referred to as 
“robots” [33-37]. Another implementation 
was in the construction of self-replicating 
devices [38] and a bacteria-inspired motor 
[39]. 


22 
Nanofabrication 


The tractability of Watson—Crick pair- 
ing enables the use of DNA as a pro- 
grammable building block for molecular 
self-assembly applications. In a seminal 
report in 1982, Seeman first proposed 
the concept of fabricating two-dimensional 
(2D) and three-dimensional (3D) lattices 
from DNA [40], which could enable the 


parallel construction of nanoscale objects 
with nanometer-scale features and accu- 
racy [17]. This technology was further elab- 
orated by the introduction of DNA origami 
(see below), which enabled the relatively 
simple fabrication of arbitrary 2D and 3D 
shapes from folded DNA. Using DNA in 
such a way enables the integration of func- 
tion into structure. As noted above and as 
will be discussed below, DNA molecules 
can be designed to carry out molecular 
computing [12, 14, 31], selectable molec- 
ular recognition [26], enzyme-free logic 
circuits [27], catalytic activity [41], and 
mechanical motion [42], while still being 
usable as a genetic code. This unique versa- 
tility makes DNA suitable for the design of 
advanced self-assembling nanoscale ma- 
chines. Integrating the components of 
positional control, information encoding, 
and computing in DNA has produced a 
variety of applications, from DNA assem- 
bly lines to controllable nanocontainers 
[3643-45]. 

An interesting branch of DNA comput- 
ing lies at the interface between DNA 
nanofabrication and computing by tiling. 
Hao Wang, who investigated this prob- 
lem during the early 1960s, proposed that 
computing by tiling is universal, in the 
sense that any computation can be carried 
out by filling a surface with tiles. Winfree 
and others have used the DNA junction 
principles and basic shapes outlined by 
Seeman (double crossover molecules; DX 
molecules [46—48}) to construct 2D tiles ca- 
pable of self-assembling, or crystallizing, 
on a surface to generate such computa- 
tional patterns and processes [11]. Such 2D 
DNA tiles can serve as information bear- 
ing seeds for self-assembly into defined 
structures or computational outcomes [13, 
49, 50], whereas such algorithmically pat- 
terned surfaces can be used as templates 
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to array other materials with nanoscale 
precision [51]. 

By using the same principles of DNA 
junction design and construction [52, 53], 
larger repetitive structures in the 2D and 
3D realms were later created, such as 
surfaces made by three-, four-, five-, and 
six-arm junctions of DNA [54] (according 
to the principles laid by Seeman, eight 
arms is the maximum rank of a junction 
using just the four canonical bases A, G, C, 
and T [18]). Introducing geometric twists 
into the junction enables multiple junc- 
tions to self-assemble into space-filling 
polyhedra with programmable properties 
[55, 56]. 

Interestingly, Aldaye and colleagues 
have recently shown that DNA nanostruc- 
tures can be genetically encoded to array 
intracellular proteins in a certain structure 
to reprogram biochemical reactions [57]. 


2:3 
Nucleic Acid Enzymes 


The astonishing demonstrations by Alt- 
man and Cech that RNA molecules can 
have a catalytic function [58, 59], and their 
discoveries that synthetic nucleic acids can 
be selected and evolved in vitro to ex- 
hibit such functions [60-63], unleashed a 
new and very powerful technology. A wide 
range of enzymatic activities has already 
been selected from nucleic acid pools, and 
it is conceivable that in the near future, cat- 
alytic nucleic acids (‘‘ribozymes’’) could be 
designed and synthesized de novo without 
selection. 

DNA is less common than RNA in 
ribozymes, since RNA is more struc- 
turally and functionally diverse than 
DNA; however, the obvious advantages of 
DNA - namely the chemical stability and 
synthesis costs — could lead to improved 
ways of making diverse functions from 
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DNA. Still, RNA catalysts can be inte- 
grated into a DNA machine using standard 
techniques of DNA nanofabrication [64]. 


2.4 
Nucleic Acid Sensors 


In 1990, the discovery was made 
independently in two laboratories that 
short, single-stranded nucleic acids — 
aptamers—could be selected from a 
random pool of sequences to bind to a 
molecular target of choice [26, 65]. This 
binding is mediated by the aptamer as- 
suming a certain 3D structure, enabling it 
to access its target and hydrogen bond with 
it. This technology garnered high interest 
as a potential strategy to block undesired 
functions or proteins in vivo (AIDS, 
cancer growth factors, etc.), and several 
aptamers are currently in advanced stages 
of clinical trials as therapeutic agents. 

Aptamers can be used for the detection 
of very low amounts of a chosen analyte, 
by their integration into a mechanical 
sensor. Such sensors, which often are 
referred to as ‘‘beacons,’” usually consist 
of a DNA/RNA aptamer bound to a partial 
complementary of itself forming a hairpin, 
while one end of the molecule is labeled 
with a fluorophore and the other with a 
dark quencher. Binding of the analyte by 
the aptamer displaces the complementary 
strand and allows a fluorescent signal to 
be released and detected. This concept of 
an aptamer-based mechanical sensor is 
analogous to the “‘riboswitch,” a regulatory 
region of RNA molecules, which exerts 
its function by binding to certain target 
molecules in the cell. An interesting 
application based on this concept is that 
of genetically encoded sensors for various 
molecules in the cell. 


Aptamers can also be used as sensors or 
mechanical gates for DNA nanorobotics, 
as will be described below [66]. 


2.5 
DNA Actuators and Motors 


It has been seen already that strand 
displacement reactions can be utilized in 
the design of “walkers” —DNA strands 
that are displaced and reattached to a 
complementary DNA track along a certain 
path, be it linear or circular. Walkers 
have been utilized for carrying payloads 
between points of origin and destination 
to drive chemical syntheses or a more 
elaborate assembly of large payloads [35, 
45]. Additionally, strand displacement was 
used to drive a polymerization motor 
inspired by bacterial locomotion [39]. 

These elegant designs all have a cen- 
tral drawback in an applied context; that 
is, the reactions are not autonomous but 
require the sequential addition of displac- 
ing strands (termed “fuel” [67]) at every 
step, and removal of the double-stranded 
displacement products (termed ‘‘waste’). 
Since, without the respective addition or 
removal of fuel and waste, the reaction 
would rapidly be halted, these actuators 
are not suitable for use in therapeutic ap- 
plications unless the fuel strands were to 
bea product of a disease process, and there 
existed an intrinsic mechanism for waste 
removal. 

During the decades following Watson 
and Crick’s discovery of the structure of 
DNA, much information was accumu- 
lated regarding the mechanical properties 
of DNA. Single molecule methods have 
enabled studies of DNA molecules in sin- 
gle strands and in molecular motors [68, 
69]. Properties such as Young’s modu- 
lus (on the order of 0.3-1.0GPa, much 
like a stiff plastic), persistence length (ca. 


50nm), and the ability of DNA to shift 
between alternative structures driven by 
environmental changes, can be integrated 
into the design of nanoscale motors and 
actuators which do not rely on strand 
displacement, in which kinetics is dom- 
inated by sequence. Mao and colleagues 
have demonstrated an actuating device 
based on the transition of DNA from 
the B to Z structure upon addition of 
hexaminecobalt [70], a reagent favoring 
the left-handed folding of DNA. Several 
examples of mechanical DNA ‘“‘tweezers,” 
powered by various stimuli and capable 
of holding a molecule and releasing it on 
demand, have also been described [71-73]. 


2.6 
Miscellaneous Functions of Nucleic Acids 


Based on the above-mentioned reports, it 
might be assumed that it would be pos- 
sible to select a DNA or RNA sequence 
that exhibited any desired property. The 
property space of nucleic acids, while os- 
tensibly large, is still essentially obscure. 
For example, the kinetics and specificity 
of aptamer binding are still not well un- 
derstood, and whether there is a common 
structure or sequence motifs in catalyti- 
cally active sequences is unknown. Never- 
theless, some interesting directions have 
already been highlighted, such as RNA 
mimics of green fluorescent protein (GFP) 
[74]. 


2.7 
DNA Origami 


DNA origami-—the folding of a single 
DNA strand to yield a diverse array 
of shapes—was first demonstrated by 
Rothemund [19]. This technique (termed 
“scaffolded” DNA origami to distinguish 
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it from other methods of DNA nanofab- 
rication by self-assembly), makes use of 
a “‘scaffold’” DNA strand that is typically 
several thousands of bases long. Direct- 
ing the folding of the scaffold strand to 
the desired shape is achieved by hundreds 
of short (typically 15 to 60 bases long) 
oligonucleotides, called “staple” strands, 
which hybridize to distinct regions in the 
scaffold strand and induce crossovers that 
keep the entire structure solid. The fold- 
ing reaction is carried out by mixing the 
scaffold and staple strands and subjecting 
the mixture to a temperature annealing 
ramp, which enables the structure to fold 
properly without being caught in local en- 
ergy minima. Scaffolded DNA origami is 
remarkably robust and reproducible, and 
allows the fabrication of an astonishing va- 
riety of shapes with arbitrary features and 
geometries [75-79]. 

The physics of DNA origami folding 
is still largely unclear in light of the re- 
markable robustness of this system to 
generate very complex structures both 
in 2D and 3D forms. The temperature 
annealing ramp is used as an error cor- 
rection mechanism, enabling the system 
to achieve its designed global minimum 
efficiently; however, the role of mechani- 
cal entropy in DNA origami folding is 
not clear. Scaffolded DNA origami objects 
seem to be very tolerant to mismatches and 
omissions of some strands [19], and can 
fold correctly most of the time, given that 
the object has been properly designed, the 
buffer contains a sufficient magnesium 
concentration, and that the staple strands 
are in sufficient excess over the scaffold 
strand. 

Scaffolded DNA origami has several ad- 
vantages over traditional DNA nanofab- 
rication methods in the context of the 
serial manufacture of DNA devices. Other 
DNA nanotechnology techniques rely on 
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a mixture of short oligonucleotides inter- 
acting with each other, which makes the 
structure very sensitive to stoichiometry 
and strand-strand competition. The way 
around this is usually to break up the 
synthesis into steps, where the product of 
each step is isolated on a solid phase or by 
some other means. In contrast, scaffolded 
DNA origami is very tolerant to staple 
strand concentration variations, and usu- 
ally folds correctly as long as each staple 
is given a chance to interact at least once 
with the scaffold, making synthesis signif- 
icantly simpler with higher yields. Second, 
it has been shown recently by Shih and col- 
leagues, that scaffolded DNA origami can 
be folded from a double-stranded scaffold 
[80], making it feasible to encode both scaf- 
fold and staple strands on a large plasmid 
for scaling-up purposes. 

Recently, Dietz and colleagues have 
elucidated folding as a phase transition 
phenomenon [81], where nothing or very 
little happens until a certain, critical 
temperature is reached which depends 
on the folding shape. At the critical 
temperature, during a potentially very 
short period, most of the object folds 
correctly and remains unchanged until 
the end of the process. This discovery 
has significant implications regarding the 
future serial manufacture of DNA origami 
devices for therapeutic applications (this 
will be discussed below). 


2.8 
Computer-Aided Design (CAD) Tools for 
DNA Nanostructures 


Computer-aided design (CAD) tools for 
the automated design and visualization 
of DNA objects have led to the promotion 
and improvement of DNA nanofabrication 
techniques. NUPACK and MFOLD are 
web servers for the structural prediction 


of nucleic acid systems [82, 83]. SE- 
QUIN was the first algorithm to enable 
the optimal design of DNA junctions 
[84], and was followed by UNIQUIMER 
3D, for the complete design of latticed 
DNA structures [85]. TIAMAT, SARSE, 
and NANOENGINEER-1 each provide ex- 
cellent user interfaces for the editing of 
DNA structures. Douglas and colleagues 
developed caDNAno, an open source CAD 
tool for the rapid design of scaffolded 
DNA origami structures [86]. caDNAno 
has also been integrated into Autodesk 
Maya, a 3D design and animation soft- 
ware with superb visualization capabili- 
ties. CANDO is a finite element-based ana- 
lyzer for caDNAno-designed DNA origami 
structures, which predicts fluctuations 
and rigidity [87]. International confer- 
ences (such as DNA and FNANO) and 
friendly competitions (http://biomod.net) 
have helped to spread this knowledge 
base and to standardize DNA design and 
nanofabrication. Some of these design 
tools, along with the web sites where they 
are available for download or online work, 
are listed in Table 1. 


2:9 
Designing Scaffolded Origami in caDNAno: 
A Guided Tour 


In this section, the aim is briefly to explain 
the process of designing a DNA origami 
shape in caDNAno, mostly because this is 
a highly versatile tool with a user-friendly 
interface, which was used to design the 
prototypical nanorobot described later in 
the chapter. 

The explanation will be provided in the 
form of a list of instructions. 


1. Download and install Autodesk Maya 
(student edition) and the most recent 
version of caDNAno, according to the 
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Tab. 1 Nucleic acid design tools. 
Tool Use Web site 
MFOLD Predict the folding of RNA/DNA http://mfold.rma.albany.edu/ 
into secondary structures 
NUPACK Predict the folding of DNA/RNA http: //www.nupack.org/ 
into secondary structures 
UNIQUIMER-3D DNA nanostructure design http://ihome.ust.hk/~keymix/ 
uniquimer3D/index.htm 
TIAMAT DNA nanostructure design http://yanlab.asu.edu/ 
Resources. html 
SARSE DNA nanostructure design http: //cdna.au.dk/software/ 
CADNANO Scaffolded DNA origami design http://cadnano.org 
CANDO Mechanical prediction of DNA http://cando-dna-origami.org/ 


origami object properties 
Graphical editor for DNA 
nanostructures 


NANOENGINEER-1 


http://www.nanoengineer-1.com/ 
content/ 


instructions on the caDNAno web site 
(http: //cadnano.org). 

2. The design interface has three panels 
(Fig. 2): 
a. Lattice panel on the top left 
b. Editing panel on the bottom left 


c. A 3D graphical visualization panel 
on the right. 

3. Choose, from the top toolbar, the type of 
lattice you wish to design the shape on. 
Two types are available, a honeycomb 
lattice or a square lattice; both are 


Autodesk Maya 2012 x64 - Student Version: untitled" 


Fig. 2. The caDNAno design interface. The lattice panel is at 
the top left; the editing panel is at the bottom left; and the 
3D visualization panel is on the right. 
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valid for the design of origami shapes. 
The honeycomb lattice is inherently 
twist-corrected, while shapes designed 
on the square lattice are inherently 
twisted, although this can be corrected 
during design [88]. 


. On the lattice panel, draw the section 


of the shape. Each circle in the lattice 
represents a DNA double helix, viewed 
from the side. Choose helices by click- 
ing on them once. Helices are num- 
bered automatically. For convenience 
purposes, it is best to start with a helix 
that is numbered as “0” and not “1.” 


If the first helix is numbered ‘‘1,” sim- 
ply undo and choose an adjacent helix 
(Fig. 3). 

Drawing the entire section could be 
done by clicking one helix at a time, 
or alternatively by holding the mouse 
button down and moving the cursor 
from helix to helix in a continuous path. 


. The first click will tag the helix yellow. 


A second click will tag the helix orange 
and open it for editing in the editing 
panel. Note that helix numbering in 
the editing panel is consistent with 
numbering in the lattice panel. 


Fig. 3 Drawing the shape section in the lattice panel. 


Note that while carrying out step (5), 
the scaffold strand path (blue) has been 
outlined automatically by caDNAno. If 
this is satisfactory, continue to step 
(6); otherwise, one undo action will 
erase the automatically chosen path, 
leaving scaffold primers, which can be 
connected manually as desired. 

In addition, the 3D shape will form in 
real-time as you carry out these actions 
on the right panel (Fig. 4). 

. The editing panel grid shows each 
double helix as two rows of square 
cells. Each row is a strand, with each 
cell representing a base. Note that the 
scaffold strand occupies only one row 
in each double helix, and that the scaf- 
fold alternates between the parallel and 
antiparallel rows in adjacent helices. 
The grid size is a default value. 
To extend the grid, click the gray 
arrowheads on top of helix Oin the 
editing panel, and insert the desired 
extension (in multiples of 21 bases, 21 
being two complete turns of the DNA 
double helix). 

. The scaffold strand path can be ex- 
tended by selecting parts of it and 
dragging them to the desired direction. 
The features to be selected can be cho- 
sen in the toolbox under “‘selectables,” 
allowing the selection of only scaffold, 
staples, crossovers, or strands. This is 
particularly useful when the shape is 
very dense and several features are 
densely overlaid on each other. 

. To manually introduce a crossover be- 
tween strands in adjacent helices, look 
for the locations marked as allowing 
crossover to take place. In these loca- 
tions, bases in the two adjacent helices 
face directly each other, making it pos- 
sible for a strand to cross from helix to 
helix without deformation. The allowed 
locations are marked with little bridge 
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icons (Fig. 5). Simply click an icon to 
form a crossover. To delete a crossover, 
select the crossover and delete. 


. Once the scaffold strand path has been 


drawn, click “‘Autostaple.” This orders 
caDNAno to staple the entire shape 
automatically. However, some of the 
resulting staples will require further 
editing — these staples will appear as 
thick lines, in contrast to valid staples 
which will appear as thin lines. 


10. The staples can be edited manually, by 


extending them (if they are too short), 
deleting them (if they are unnecessary), 
or introducing breaks (if they are too 
long or circular). The latter is carried 
out by choosing the ‘‘Break”’ tool in the 
side toolbar, and clicking the point in 
which the break is to be inserted. 

Alternatively, staples can be automat- 

ically broken. For this, click ‘“Auto- 

break.” A dialog box will open asking 
for values for four parameters: 

a. Target length: optimal length for 
staples after breaking (default value 
is 49). 

b. Min length: minimal staple length 
(default value is 15). 

c. Max length: maximal staple length 
(default value is 60). 

d. Min distance to crossover: the min- 
imal distance a staple will travel 
before crossing over to an adjacent 
helix (default value is 3). 

For now, leave the default values un- 

changed. With practice and experience, 

one can get a better sense of how these 
parameters are transformed to the de- 
sign. 


11. After the shape has been stapled and 


edited (Fig. 6), all that is left is to 
assign a scaffold strand sequence to 
it. For this, click the “Seq” tool in 
the side toolbar, and click at a point 
within the scaffold strand where you 
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Fig. 4 Editing the scaffold strand path in the editing panel. Note that helices in the lattice panel are tagged orange, and the shape is 
being formed in real time on the 3D visualization panel on the right. 
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AutoBreak 


Fig. 5 Introducing crossovers between helices. The bridge 
icons mark the locations where crossovers are technically 


allowed. Numbers above and below the icons denote 
number of helix to which the strand will cross over. 


wish sequence assignment to begin. 13. 


For practice purposes, this can be any 
point along the scaffold strand. 

12. In the dialog box, choose the cor- 
rect scaffold DNA from the list. Al- 
ternatively, click “custom” and paste 
a sequence of choice. Note, however, 
that the sequence chosen must not be 
shorter than required for completely 
filling the designed scaffold strand 
path. You will be notified in such a case 
to choose a different scaffold sequence. 


the 


After choosing scaffold sequence, the 
sequence itself will appear in the edit- 
ing panel. Make sure the entire shape 
is assigned with sequence. Regions 
of scaffold strand that are not as- 
signed indicate the shape is incon- 
sistent, for example, that parts of the 
scaffold strand form ‘‘islands.”” Go 
over the entire design again to elim- 
inate such incidents, and assign se- 
quence again to ensure the problem is 
solved. 
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Fig. 6 The shape after Autostaple and Autobreak have been carried out. Note that staple colors in the editing panels are consistent 
with staple colors in the 3D panel on the right, making it easier to locate them if needed. 


14. From the top toolbar, click ‘‘Ex- 
port,” which will save the staple strand 
sequences in. CSV (comma-separated 
values) format. This can be later opened 
using a spreadsheet editor. 

15. In the spreadsheet editor, sequences 


the staple is not hybridized with any 
scaffold. This is fine, and sometimes 
designed are based on leaving ssDNA 
tails of staples extending from the 
shape. These sequences need to be 
assigned manually. 


3 
DNA Nanorobotics 


In this section, the ways in which different 
components can be connected to construct 
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a DNA nanorobot will be discussed, and 
the best way determined to utilize such a 
nanorobot for therapeutic applications. 


3.1 
Case Study: A DNA Nanorobot 


Recently, DNA origami was used to design 
and fabricate an autonomous, logic-guided 
DNA origami nanorobot, which can be 
programmed to transport molecular pay- 
loads between selected points of origin and 
target [66]. 

The nanorobot resembles a hexagonal 
clamshell open at both ends, with two 
sides of the clamshell revolving around 
two single-stranded DNA (ssDNA) axes 
(Fig. 7). On the opposite side, a gate 
consisting of two double-stranded DNA 


Fig. 7. A molecular model of the prototypical 
DNA nanorobot. The nanorobot resembles a 
hexagonal clamshell open at both ends, with 
two sides of the clamshell revolving around 
two ssDNA axes (not seen here). On the op- 
posite side (shown here as the front side), 

a gate consisting of two dsDNA arms con- 
trols the nanorobot state: when the arms 

are in dsDNA configuration, the two halves 


of the clamshell are held locked. However, 
when these duplexes unzip and open, the 
nanorobot is free to entropically open, ex- 
posing its internal side. This side can be 
loaded with a variety of cargoes including 
small molecules, drugs, proteins (pink), and 
small (<30-35° nm) nanoparticles (yellow). 
The number, stoichiometry, position and order 
of the payloads can be carefully planned. 
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(dsDNA) arms controls the nanorobot 
state. When the arms are in dsDNA config- 
uration, the two halves of the clamshell are 
held locked; however, when these duplexes 
unzip and open, the nanorobot is free 
to entropically open, exposing its internal 
side. This side can be loaded with a vari- 
ety of cargoes including small molecules, 
drugs, proteins, and small (<30-—35 nm) 
nanoparticles. The number, stoichiome- 
try, position, and order of payloads can be 
carefully planned. 

Each DNA arm in the gate is made 
from a DNA or RNA aptamer, which 
is designed to sense an input molecule 
of choice, hybridized to a partially com- 
plementary strand. In the presence of 
the input molecule, the aptamer system 
switches to an equilibrium between the 
aptamer—complementary strand complex 
and the aptamer—molecule complex. The 
rate at which this occurs is difficult to 
model, because the system is governed by 
the thermodynamics of the DNA duplex 
on the one hand and by the mechan- 
ics of the nanorobot structure on the 
other hand. The hybridization strength be- 
tween the aptamer and its complementary 
strand governs a trade-off between speci- 
ficity and sensitivity; fewer mismatches 
would lead to a nanorobot that could 
take a long time (on the order of many 
hours) and high concentrations of the in- 
put to open, but is very specific to the 
chosen input. Alternatively, more mis- 
matches would make a more sensitive 
and open-prone nanorobot, which might 
have a higher chance of a false-positive 
activation. 

Programming of the nanorobot is 
achieved by choosing the appropriate ap- 
tamer components of the gate from the 
aptamer pool, and assembling them on 
the nanorobot chassis, in turn setting 
the nanorobot to be activated when it 


encounters the predefined signature of 
molecules and biological conditions recog- 
nized as correct input by the gate. On acti- 
vation, the nanorobot undergoes a drastic 
conformational change and exposes its 
cargo, which was previously sequestered, 
enabling it to interface with the point of 
destination, for example, a tumor cell or 
a target tissue. The nanorobot can also re- 
vert to its inactive state in the absence of 
the required inputs, making its cargo con- 
cealed again, as demonstrated by dynamic 
light-scattering experiments. 

In various demonstrations, nanorobots 
loaded with cancer therapeutics were ca- 
pable of targeting specific tumor cells 
extremely selectively, and inducing growth 
arrest and apoptosis. The nanorobots were 
also capable of scavenging a bacterial 
protein from dilute solutions and car- 
rying them to T cells, inducing their 
costimulation specifically for bacteria and 
mimicking a primitive mode of antigen 
presentation. 

An important predecessor to this 
nanorobot, and a landmark in the field 
of DNA nanorobotics, was the DNA 
origami cube described by Andersen and 
colleagues [44]. 


3.2 
Sensing the Environment 


The nanorobot gate in the above- 
mentioned demonstration consisted of 
two aptamers — one for each input — such 
that the nanorobot requires both inputs 
molecules to open. These input molecules 
are not necessarily proteins. Some types 
of input that a DNA or RNA sequence can 
sense are as follows: 


1. Proteins or small molecules (e.g., 
cancer markers, immune mediators, 
neurotransmitters, hormones, ATP, 


bacterial lipopolysaccharides) — by ap- 
tamers. 

2. Bacteria — by a DNA sequence contain- 
ing a restriction site for one of the 
enzymes produced by the bacterium 
of choice. It is important to note, how- 
ever, that such a design suffers from the 
obvious drawback of being a one-time 
mechanism. 

3. Nucleic acid sequences (e.g., from 
viruses) — by designing a gate in which 
one of the arms is displaced by the 
target sequence. 

4. Temperature — by carefully designing a 
gate with the appropriate melting tem- 
perature (although the design would 
have to take into account salt concen- 
trations and other factors that modulate 
the melting temperature). 

5. DNA-binding proteins (e.g., transcrip- 
tion factors) — by a DNA sequence con- 
taining the cognate binding site of the 
protein. 


In addition to these natural signals 
readable from the environment, it might 
be desirable to introduce user-generated 
signals that would enable the nanorobots 
to be operate from outside the body. Three 
examples include: 


1. Ultraviolet (UV) light - by DNA gates 
containing UV-cleavable spacers or 
UV-sensitive bases. Most UV wave- 
lengths are too energetic to penetrate 
tissues, and rather are absorbed so as 
to cause tissue damage; in contrast, in- 
frared (IR) light can safely penetrate 
tissues to a certain depth, but is not en- 
ergetic enough to alter DNA. A potential 
solution to this is to use upconverting 
nanoparticles [89], that sum up the en- 
ergy from IR photons to produce UV 
photons. This design, too, would typi- 
cally be a one-time mechanism. 
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2. Magnetic fields—using magnetite 
nanoparticles tethered to the gate in 
such a way that in a certain magnetic 
field the particles would repel each 
other, leading to an opening of the 
gate. 

3. Radiofrequency — Hamad-Schifferli 
and colleagues have shown that DNA 
hybridization can be electronically 
remote-controlled by attaching the 
DNA to a metal nanocrystal antenna. 
Placing this construct in an electro- 
magnetic field at a certain frequency 
induced heating of the metal and 
subsequent melting of the DNA duplex 
[90]. 


Of course, natural and user-generated 
signals could be integrated in one 
nanorobot for purposes of better control 
and increased safety. 


3.3 
Information Processing and Logic Types 


A central part of a robot is the processor, 
which processes the inputs received from 
the sensors and generates a logical output. 
In this example, the nanorobot gate which 
serves as the processor is composed of 
two sensors connected to the chassis 
in such a way that sensing two inputs 
emulates a logical AND gate controlling 
the nanorobot state. 

It might be helpful to consider the 
prototypical nanorobot described here as 
analogous to a combination lock, in which 
a series of cams with symbols (usually nu- 
merals) etched onto them can freely rotate 
relative to each other. Thus, the system 
may be at any of p” states, where p is 
the range of symbols (e.g., 0-9) and n is 
the number of cams. As only one com- 
bination of symbols will allow the lock 
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to be opened, the lock functions much 
like a mechanical-logical n-input AND 
gate, with inputs selectable from space 
p. In the same way, each of the arms in 
the nanorobot can function as a cam in 
a combination lock, typically with a bi- 
nary input (the molecule can be either 
present or absent). In this demonstration, 
the nanorobot has only two arms, which 
is precisely equivalent to an electronic 
two-input AND gate; however, an addi- 
tional aptamer-complementary system can 
hypothetically be linked serially to the first 
system comprising the arm, with or with- 
out certain spacing in between them to 
ensure that the two systems are truly inde- 
pendent. 

In the present demonstration, the native 
state of each arm (in the absence of 
its cognate input molecule) is closed. 
However, it should be straightforward to 
design also arms that are open in the 
native state, and close in the presence of 
the input. Such an arm could be based on 
a hairpin structure, from which binding 
of the input molecule releases a segment 
that attaches to the nanorobot chassis and 
links both halves of the clamshell. This 
design can be adjusted to sense the other 
types of input listed above. 

Although the design discussed here is 
merely a prototype, and is not necessarily 
optimal, its mechanical framework makes 
it simple to design logic. The concepts dis- 
cussed here can also be used with other 
types of device. As an example, consider a 
DNA polyhedron of the category described 
by Mao and colleagues [55], in which seg- 
ments of the junction motifs that build 
the shape are replaced with aptamers or 
other DNA-based sensors. Such a device 
might be a polyhedron that sequestered 
a drug but disintegrated in the presence 
of the input, whether a bacterial or tu- 
mor enzyme was degrading the sensor 


sequence. This simple device could re- 
spond to a single type of input; however, 
to raise its logic capacity would require a 
drastic change in its fabrication strategy, 
which should be considered. Mao’s poly- 
hedra are typically built using algorithmic 
self-assembly, in which the number of parts 
required to build a structure is larger than 
the number of types of different parts. The 
self-assembly of scaffolded DNA origami, 
for example, is not algorithmic as each 
building block is unique and has a specific 
position in the folded shape. The integra- 
tion of additional sensors would almost 
certainly require additional types of build- 
ing block, which in turn would mean that 
the polyhedron could not be constructed 
elegantly from a single junction motif. 
However, this could be accomplished with 
a careful design of the structure, taking 
into consideration the potential conse- 
quences for a correct assembly of the 
shape. 

In this regard it is important also to 
mention the DNA computers described 
by Benenson and colleagues [24, 25, 
91]. These constructs are not mechanical 
devices per se; rather, they can be thought 
of as pieces of code. Indeed, it might 
be imagined that such a construct could 
be used as the logical processor of a 
mechanical DNA nanorobot, similar to 
those discussed here. 


3.4 
Collective Behaviors 


Natural and artificial systems of many 
independent agents demonstrate the abil- 
ity to collectively perform many complex 
tasks — for example, construction (Turner, 
J.S., American Entomologist, 51, 36-38, 
2005 [1]; Petersen, K. et al., Robotics: Sci- 
ence & Systems VII, 2011), search [92], and 


locomotion [93] (Murata, S. and Kurokawa, 
H., [EEE Robotics & Automation Magazine, 
March 2007) -—with greater speed, effi- 
ciency, or effectiveness than could a single 
agent alone. Direct and indirect coordina- 
tion methods allow agents to collaborate 
to share information and to adapt their 
activity to fit the situation. 

While most such systems rely on capa- 
bilities well beyond those of nanorobotics, 
some systems could be realized and imple- 
mented in real DNA nanorobots for ther- 
apeutic applications. For example, DNA 
strand displacement cascades similar to 
those demonstrated by Qian and col- 
leagues [12, 31] can form the basis for a 
population of nanorobots interacting with 
each other, and operating in a defined 
sequence. An additional system might 
be of nanorobots building physical struc- 
tures in one, two, or three dimensions in 
response to an input molecule or condi- 
tion. 

DNA nanorobots exhibiting collective 
behaviors could be extremely valuable 
in therapeutic applications. For example, 
they could carry out an autonomous exci- 
sion of an abnormal tissue by collectively 
defining the target tissue, enzymatically 
cutting it from its environment, and then 
cooperating to carry it to a different lo- 
cation, possibly close to the skin, where 
it would be simpler to remove. Another 
example is that of nanorobots which can 
coordinate their activity to enable mullti- 
ple drugs to function in parallel, without 
actually colliding with each other and gen- 
erating adverse effects. 

While this situation is still speculative, 
the next few years could witness the first 
prototypes of DNA nanorobots capable of 
exhibiting collective behaviors — a concept 
that has so far only been discussed in 
simulations and in-silico models. 
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3.5 
Synthetic Genetic Circuits and DNA 
Nanorobotics 


During the past decade, several research 
groups have described new types of engi- 
neered construct that could program a cell 
to express genes in a predefined, complex 
fashion. For example, synthetic circuits 
have been designed to function as os- 
cillators [94], as toggle switches [95] and 
counters [96], and multicellular systems 
have also been described [97]. A compre- 
hensive discussion of these constructs and 
their engineering and properties is beyond 
the scope of this chapter, but details can 
be found elsewhere [98]. 

While such synthetic circuits can op- 
erate only in the context of cells, they 
can nevertheless be coupled to the syn- 
thesis of a secreted signal to which the 
nanorobots outside the cell can respond. 
This would enable nanorobots to coordi- 
nate their activity, to oscillate, and to per- 
form essentially as an extracellular avatar 
governed by the genetic circuit's product 
and following its dynamics. Clearly, this 
could open up exciting possibilities for 
controlling nanorobots in therapeutic set- 
tings. 


4 
Challenges of Applying DNA Nanorobots to 
Therapeutics 


DNA _nanorobotics could improve 
medicine and therapeutics with novel 
capabilities that are completely unachiev- 
able using current technologies. However, 
several technical challenges could hinder 
the use of DNA nanorobots as therapeutic 
devices in vivo. 

DNA is not an optimal material for 
constructing therapeutic devices for two 
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main reasons. First, it is rapidly degraded 
by nucleases in the serum and tissues, 
most notably by DNase I. Although the 
activity of DNase I is highly variable in the 
population owing to several mechanisms 
[99], it is very efficient and DNA origami 
has been shown to survive for only very 
short periods of time under DNase I 
treatment [87]. Efficient extracellular DNA 
digestion serves an immune function, as 
DNA floating freely in the serum can 
be most likely associated with pathogens, 
particularly viruses. Therefore, the use of 
DNA nanorobots as therapeutic devices 
requires, primarily, a good method of 
overcoming nuclease susceptibility. 

Second, as noted above, DNA can be in- 
terpreted as a pathogenic component and 
consequently can exert a potent immune 
response. The mechanism responsible for 
this recognition of DNA is Toll-like re- 
ceptor 9 (TLR9), an intracellular receptor 
that binds nonmethylated CpG DNA [100]. 
However, it is not only exogenous DNA 
that is immunogenic; if DNA that has been 
spilled from necrotic or damaged cells is 
not cleared efficiently, then autoimmune 
diseases such as systemic lupus erythe- 
matosus (SLE) might erupt [101, 102]. 
Clearly, the inherent immunogenicity of 
DNA is yet another obstacle that could 
hinder the use of DNA devices in thera- 
peutics. 

Nonetheless, some useful insights can 
be brought from orthogonal fields in 
which the delivery of systemic DNA is 
also a primary goal. Such fields include 
gene therapy and DNA vaccines. In these 
contexts, methods devised to increase the 
survivability of DNA in the serum and 
tissues could (hypothetically) be adapted 
and used for DNA origami nanorobots. For 
example, DNA origami structures could be 
embedded in nanoparticles or liposomes, 
coated with polyethylene glycol (PEG), or 


designed to exclude labile or immunogenic 
motifs, although it is still not clear whether 
the nanorobots could be modified and 
maintain their functionality. 

Finally, it is still not appreciated 
how efficiently DNA origami can enter 
cells. “Professional” phagocytes such 
as macrophages and dendritic cells can 
uptake DNA origami efficiently [103], 
thus highlighting a potential use for DNA 
origami structures as intracellular delivery 
vehicles. On the other hand, certain 
DNA structures such as the nanorobot 
described here can enter cells extremely 
slowly, which suggests that this is a 
cell-specific phenomenon. Nevertheless, 
DNA origami could be functionalized 
with cell entry-promoting molecules such 
as positively charged peptides or polymers 
(e.g., compound 48/80). Alternatively, 
DNA origami could be decorated with 
ligands to receptors undergoing massive 
endocytosis and recycling, such as 
receptors for immunoglobulin A and 
oxidized low-density lipoprotein (LDL). 

Despite these major challenges, it is 
highly likely that DNA nanotechnology 
will enter the arena of therapeutic devices 
within the next few years. Moreover, 
the nanorobots described in this chapter, 
along with other technologies, represent 
good candidate platforms [66, 103, 104]. 


5 
Summary and Conclusions 


During a 30-year period of research into 
DNA nanotechnology, insights regarding 
the structure, topology and mechanics of 
DNA have been translated into fascinat- 
ing nanoscale devices, whether comput- 
ers, motors, and/or architectures. How- 
ever, it is only just being realized how 
these components could be integrated 


to create an entirely new generation of 
autonomous robots that would be capa- 
ble of reading from and writing to the 
biochemistry and physiology of a living 
organism, in using drugs, molecular sur- 
gical tools, endogenous hormones and 
growth factors to improve the outcome 
in any case. Although several challenges 
clearly need to be addressed before this 
development can take place, it is safe to 
assume that the next five years will wit- 
ness the first successful implementations 
of DNA nanorobotics in therapeutics and 
biomedical applications. Once successful, 
DNA nanorobotics could revolutionize the 
paradigms in a diversity of fields ranging 
from drug discovery and design to sur- 
gical procedures, regenerative medicine, 
and aging. 
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(miRNA), and small interfering RNA (siRNA) - are central to RNA interference. Both 
siRNA and microRNA exert post-transcriptional/translational repression by base-pairing 
to their RNA targets with the aid of Argonaute proteins and their associated proteins. 
RNA interference is essential to virtually every aspect of cell physiology and plays an 
important role in defending cells against parasitic nucleotide sequences such as viruses 
and transposons. 


Logic circuit 
A logic circuit is a collection of connected modular parts that implements a potentially 
sophisticated logical operation on one or more logic inputs and produces a corresponding 
logic output. 


Genetic circuit 

A genetic circuit is an assembly of functionally connected genetic parts. Unlike 
electronic circuits connected by electrical signals gene parts orchestrate the levels 
of gene expression in living cells through molecular signals. 


Noncoding small RNAs regulate gene expression through complex RNA inter- 
ference (RNAi) signaling networks. The elucidation of molecular mechanisms 
underlying RNAi and the discovery of commonly occurring transcriptional and 
post-transcriptional motifs have enabled the construction of RNAi-based sensors 
and devices used for engineering genetic modules and logic circuits that offer 
sophisticated control of biological systems. In this chapter, recent progress in the 
design and implementation of RNAi-based logic circuits for the sensing and pro- 
cessing of multiple molecular signals to generate programmed biological actuation 
is discussed. 


1 commonly found motifs in natural bio- 
logical networks and a general framework 
used in many nonbiological engineered 


RNA interference (RNAi) is mediated by devices, synthetic gene circuits are orga- 
small noncoding RNAs [1, 2], mainly mi- nized into three general modules (Fig. 1): 
croRNA (miRNA) and small interfering 
RNA (siRNA) in metazoans. The function 
of RNAi is implicated in almost every as- 
pect of cell physiology through complex 
biological networks. As such, RNAi pro- 
vides a versatile, modular and scalable 
interface between synthetic circuits and action to take. 

endogenous molecular inputs, and can ° Producing a biologically active output 
expand the abilities of synthetic circuits to actuate a physiological effect on the 
to control biological systems. Inspired by cell [3]. 


Introduction 


e Sensing of relevant input conditions 
inside and outside the cell. 

e “Computing” or processing those in- 
puts to determine whether and which 
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Information processing circuit overview. Signaling 


molecules interact with a sensory module, the signals are 
transferred to a computational module that evaluates the 
logic and triggers an actuation module. The output can 
influence cell state itself or the cellular environment. 


Using an analogy to electronic circuits, 
these biological functions can be imple- 
mented by a group of connected genetic 
switches and circuits. For example, in cer- 
tain situations the operation of genetic 
circuits can be approximated as perform- 
ing logic functions using assemblies of 
Boolean “‘logic’”’ gates, such as “AND,” 
“OR,” and “NOT” gates [3, 4]. 

The logic circuit principle has been 
used widely in engineering electronic de- 
vices. Whilst engineering a genetic logic 
circuit in cells is still much more chal- 
lenging than their electronic counterparts, 
great strides have been made during re- 
cent years. Standardized biological parts 
that function at various regulatory lev- 
els have been defined and cataloged [5], 
such as promoters [6-8], terminators [9], 
RNA switches [10-19], ribosomal binding 
sites [20], DNA-binding and transcription 
regulatory motifs, and protein-protein 
interaction domains [21-25]. While vari- 
ant synthetic circuits have been devel- 
oped to perform programmed dynamic 


behaviors in cells (oscillators [26-31], 
memory [32-35], spatial patterns [36, 37], 
cascades [38] and pulse generators [39], 
digital and analog computations [12, 40, 
41], and complex biosynthetic pathways 
[42, 43]), most of these circuits are de- 
signed to generate controllable output in 
response to exogenous inputs or just a 
few endogenous inputs [44-47]. Genetic 
circuits that actuate programmable logic 
functions upon sensing multiple endoge- 
nous molecular signals would enable the 
sophisticated manipulation of living cells 
to achieve desired goals. 

Recently, RNAi-based genetic circuits 
have attracted increasing attention in syn- 
thetic biology [48-52]. In this chapter, at- 
tention is focused on recent progress with 
synthetic genetic circuits that utilize RNAi 
for sensing endogenous molecular signals, 
processing and actuation in mammalian 
cells. First, basic RNAi signaling pathways 
are introduced, after which natural and 
synthetic feedback and feed-forward loops 
(FFLs) that couple both transcriptional 
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and post-transcriptional regulation are 
discussed. Sensing mechanisms for vari- 
ous types of molecules and a design frame- 
work for RNAi-based logic circuits are then 
highlighted. The chapter concludes with 
some comments on the construction and 
chromosomal integration of RNAi-based 
genetic circuits, and a discussion of the 
opportunities and challenges for their po- 
tential application in biotechnology and 
biomedicine. 


2 
Overview of RNA Interference 


Both, siRNA and miRNA can exert post- 
transcriptional and translational repres- 
sion by base-pairing to their RNA targets 
with the aid of Argonaute and associated 
proteins [2, 53, 54]. RNA targets are cleaved 
and degraded when the mature siRNA 
or miRNA sequence is fully complemen- 
tary to the RNA target [55]. Alternatively, 
translation of mRNA is repressed or the 
mRNA is deadenylated when the mature 
siRNA or miRNA sequence imperfectly 
pairs to the RNA target [56]. miRNA can 
also function as a critical interface be- 
tween chromatin remodeling complexes 
and the genome for transcriptional gene 
silencing [57]. RNA editing is another layer 
of regulation that affects RNAi efficiency, 
specificity, and dynamics [58-60]. These 
two types of regulatory mechanisms have 
been summarized elsewhere [54, 61]. 
miRNAs are transcribed from both in- 
tergenic regions and intragenic regions 
in their host genomes [62]. miRNAs 
in intergenic regions are usually driven 
by their own promoters, while intra- 
genic miRNAs are highly enriched in 
introns and share the same promoter 
with host genes [63]. Primary microRNA 
transcripts (pri-miRNAs) are precursors to 


mature miRNA [64]. Pri-miRNA is usually 
transcribed by RNA polymerase II (some- 
times also by RNA polymerase III), and 
then polyadenylated and capped in the 
nucleus [53, 65]. In the canonical path- 
way, pri-miRNA is cleaved into a ~70nt 
RNA hairpin (pre-miRNA) by a complex 
of double stranded RNA (dsRNA) binding 
proteins such as Drosha and DGCR8in 
a reaction termed microprocessing (Fig. 2) 
[66, 67]. Microprocessing often couples 
with RNA splicing when miRNA resides 
in the intron of a host gene [68]. In 
some cases, pre-miRNAs are produced 
from very short introns (mirtrons) by splic- 
ing and debranching, without the aid of 
a Drosha~—DGCR8 complex [69]. Whilst 
pre-miRNA is exported to the cytoplasm 
usually by exportin 5 [70], siRNA is de- 
rived from exogenously introduced long 
dsRNA [71]. Both, dsRNA and pre-miRNA 
are usually cleaved by Dicer with the aid 
of transactivating response RNA-binding 
protein (TRBP) in the cytoplasm, produc- 
ing a ~20bp RNA duplex [72, 73]. Only 
one strand (the guide strand) of an RNA 
duplex is loaded onto the Argonaute pro- 
tein, which is the core component in the 
RNA-induced silencing complex (RISC) 
(74, 75]. 

In mammalian cells, when the tar- 
get site is a perfect match to mature 
miRNA, the miRNA will function sim- 
ilar to an endonuclease and this will 
result in multiple turnovers of mRNA 
cleavage [55]. In comparison, most en- 
dogenous miRNAs imperfectly and con- 
stantly bind their native target transcripts, 
causing translational repression or RNA 
degradation [56, 76] (Fig. 2). Interest- 
ingly, the cleavage pathway hardly satu- 
rates the function of innate miRNA [77], 
while the overexpression of plasmid genes 
with imperfectly matched targets exhibits 
a sponge effect on innate miRNA in 
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script is processed in the nucleus by Drosha 
and DGCR8 to pre-miRNA, exported to 

the cytoplasm by Exportin5, further pro- 
cessed by Dicer to mature miRNA, and 
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exogenous double-stranded RNA (dsRNA) or 
short-interfering RNA (siRNA) can be directly 
processed by Dicer and loaded on Ago2. Bind- 
ing of the miRNA/siRNA—Argonaute complex 
to the target mRNA silences its expression. 
Full miRNA-target complementarity induces 
cleavage and degradation of the mRNA, while 
partial complementarity results in translational 
repression or deadenylation. 
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mammalian cells [78]. These results imply 
that when a full-complementary target site 
is inserted into a transcript, usually in 
the 3’-untranslated region (UTR) region, 
miRNA can be rewired to cleave this 
transcript without saturating the natural 
miRNA regulation. 


2.1 
MicroRNA-Mediated Regulatory Motifs 


Network motifs are small recurring gene 
circuit topologies that perform specific 
information processing and regulatory 
functions, and often serve as_ useful 
building blocks for larger synthetic 
gene networks [79, 80]. For instance, 
computational studies have demonstrated 
that feedback and feed-forward motifs 
with coordinated transcriptional and 
post-transcriptional regulation play an 
important role in controlling gene ex- 
pression [81-89]. Small regulatory motifs 
utilizing coupled transcriptional and 
post-transcriptional regulations include 
two-node feedback loops and three-node 
FFLs. As the function of miRNA-mediated 
feedback and FFLs has been discussed 
previously [2, 90, 91], attention is now 
focused on a few examples discovered in 
natural pathways that serve as a guide for 
engineering synthetic RNAi logic circuit 
with desired functions. 

In a common two-node feedback motif, 
miRNA B (miR-B) represses component A 
(usually a transcription factor), while com- 
ponent A either activates miR-B (resulting 
in a unilateral feedback loop) or represses 
miR-B, forming a reciprocal feedback loop 
(Fig. 3a). Mathematical models suggest 
that the unilateral feedback loop exhibits 
an oscillatory expression of A and miR-B 
for a wide range of parameters, while the 
reciprocal feedback loop can function as 


a bistable switch [89]. Experimental ev- 
idence suggests that the expression of 
c-Myb and miR-15a is inversely correlated 
in cells undergoing erythroid differenti- 
ation due to a unilateral feedback loop 
(Fig. 3b) [92]. For the feedback loop com- 
prising Myc/E2F and a cluster of miRNAs 
called miR-17-92 (Fig. 3b), a model sug- 
gests that miR-17-92 plays a critical role in 
regulating E2F/Myc protein levels during 
the ON state with large-amplitude oscil- 
lations [93]. Hybrid transcriptional and 
post-transcriptional bistable switches are 
often used in cellular decision-making 
during development. For instance, the re- 
pression of let-7 by HBL-1 (which in turn 
is the target of let-7) is responsible for 
the developmental transition from L3 to 
L4 stage of Caenorhabditis elegans [94]. 
In another case, COG-1 represses DIE-1 
by promoting miR-273 expression, while 
DIE-1 inhibits COG-1 through activation 
of miRNA lys-6 in C. elegans (Fig. 3b). This 
double-negative feedback loop controls the 
transition froma hybrid precursor to either 
a left or right asymmetric taste receptor 
neuron [95]. A more complex feedback 
loop comprising HNF4a, miR-124, IL6R, 
STAT3, and miR-21/miR-629 (Fig. 3b) 
has been shown to regulate hepatocellu- 
lar oncogenesis [96]. During hepatocellular 
transformation, the transient repression 
of HNF4a is stabilized with the feedback 
loop, resulting in hepatocyte transforma- 
tion and cancer initiation. STAT3 induces 
both miR-24 and miR-629 that redun- 
dantly maintain inhibition of the HNF4a, 
which increases the stability of the HNF4a 
repression state. In a recurring three-node 
feed-forward motif, component A not only 
directly regulates component C but also 
indirectly controls component C through 
component B. The direct and indirect 
regulation of component C is consistent 
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Fig. 3 MicroRNA-mediated regulatory motifs. (a) Topology of unilateral and reciprocal feedback loops, and coherent and incoherent 
feed-forward loops where component C is regulated by component A in a consistent or opposite manner respectively; (b-d) Examples 
of regulatory motifs discovered in natural pathways; (b) Simple and complex feedback loops; (c) Natural and engineered (last panel) 
feed-forward loops; (d) Regulatory network with multiple feed-forward loops where miRNA serves as a hub. 
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in a coherent FFL, but functionally an- 
tagonist in an incoherent feed-forward 
(IFF) loop. For a hybrid transcriptional 
and post-transcriptional motif, miRNA 
can serve as either component A or B 
(Fig. 3a). Both, computational and exper- 
imental analyses have suggested that the 
miRNA-mediated FFL provides a buffer- 
ing mechanism and enhances the ro- 
bustness of gene regulation [81, 97, 98]. 
For instance, miRNA-mediated coherent 
loops are often used for the precise tran- 
sition and homeostasis of cellular states 
by repressing leaky transcripts [81, 99]. 
The FFL with PKCa, MAPK, miR-15a and 
Cyclin E controls DNA synthesis in re- 
sponse to growth and oncogenic stimuli 
in mammalian cells [97], in which the 
activation of Cyclin E is achieved by a 
PKCa-mediated fast induction of MAPK 
and a slow de-repression of miR-15a 
(Fig. 3c). This redundant induction con- 
fers resistance to signal perturbations and 
a time delay to trigger DNA synthesis 
by “AND’’-like logic that requires both 
“arms” to be turned on. As an example of 
an IFF loop, c-Myc activates E2F1 expres- 
sion and limits its level by induction of 
a cluster of miRNA including miR-17-5p 
and miR-20a, which exert tight control of 
a proliferation signal in mammalian cells 
[100]. This IFF loop may result in a pulse 
of c-E2F1 expression to promote cell cycle 
progression [79, 80]. As discussed below, 
the miRNA-mediated incoherent loop is 
also adaptive to a wide range of DNA 
copy number fluctuations [51]. miRNAs 
often serve as a hub connecting multiple 
feedback and FFLs, maintaining the ro- 
bustness of gene circuits to environmental 
perturbations. For instance, in Drosophila 
miR-7 is linked to two interlocked FFLs 
(Fig. 3d). The absence of miR-7 causes 
regulatory instability in response to tem- 
perature fluctuations [101]. 


Despite their biological importance, 
the function of many miRNA-meditated 
motifs are still poorly understood. One 
challenge encountered is that natural 
regulatory motifs are usually embedded in 
large regulatory networks, which makes 
the interpretation of motif function more 
difficult. The building of synthetic miRNA 
circuits can shed light on these questions, 
and also provide a foundation for a variety 
of applications. For instance, Bleris et al. 
constructed and tested transcriptional 
and post-transcriptional IFF circuits in 
mammalian cells (Fig. 4) [51]. In this 
case, a plasmid was constructed to include 
a_ bidirectional _ tetracycline-responsive 
promoter that drives two fluorescent 
proteins in opposite directions. One 
of the fluorescent proteins (FP1) is 
repressed by transcriptional repressor 
Lacl or synthetic miRNA FF3 that is 
coexpressed with the other fluorescent 
protein (FP2). Upon transfection of the 
plasmid into rtTA-expressing mammalian 
cells and the addition of doxycycline, 
the bidirectional promoter is activated. 
The level of FP2 indicates the abundance 
of the plasmid in the cells, while the 
level of FP1 serves as the output of the 
IFF circuit. Consistent with a mathe- 
matical prediction, both LaclI-mediated 
transcriptional and miRNA-mediated 
post-transcriptional incoherent loops 
exhibit adaptations to changes in DNA 
template numbers. Surprisingly, however, 
the post-transcriptional IFF loops demon- 
strate a superior adaptation capability to 
gene dosage variations, higher expression 
levels, and lower intrinsic fluctuations. 
This study provided an example of how 
synthetic circuits can be used to explore 
the design principles of natural regulatory 
motifs, and ultimately to guide the design 
of more complex yet reliable synthetic 
circuits for various applications. 
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Fig. 4 


miR 


(a,b) Synthetic incoherent feedforward motifs. Tran- 


scriptional and post-transcriptional incoherent feed-forward 
motifs enable adaptive gene expression in mammalian cells 


[51]. “Re” transcriptional represses FL2 in (b). 


2:2 
RNAi Sensors 


RNAi sensors can either transduce a 
molecular input to an active RNAi me- 
diator, or inhibit an active RNAi mediator 
upon sensing a molecular input (Fig. 5a). 
The RNAi mediator in turn represses the 
output level when a tandem repeat (usually 
3 x —6x repeat) of fully complementary 
siRNA target sequence is fused to the 
3’-UTR of the output gene. An appropri- 
ately designed siRNA [102] can efficiently 
knockdown such a target gene at a satu- 
rating siRNA level. The first type of RNAi 
sensor produces a true output in the pres- 
ence of input (i.e., OUTPUT = INPUT), 
while the second type inverts the input 
(i.e. OUTPUT = NOT(INPUT)). Various 
sensing mechanisms have been developed 
with RNAi that detect different molecu- 
lar inputs, including small molecules [11, 
50, 103], mRNAs [104], proteins [105], 


transcription factors [49], and miRNAs [52, 
77, 106-108]. 

Transcription factor and miRNA sens- 
ing has been amplified through the 
coupled regulation of gene expression, 
providing for an enhanced sensitivity. Leis- 
ner et al. constructed synthetic promoters 
to regulate artificial miRNAs by sensing 
either transcriptional activators or repres- 
sors (Fig. 5f) [49]. The artificial miRNA, 
created by integrating siRNA stems into 
the miR-30 miRNA backbone, modulates 
target gene expression with RNAi. Conse- 
quently, the level of a transcription factor 
(INPUT) is transduced into the level of the 
target gene (OUTPUT). 

However, it is often difficult to engi- 
neer a synthetic promoter for sensing any 
target transcription factor; rather, it is usu- 
ally easier to engineer miRNA sensors. As 
discussed above, endogenous miRNA in- 
put can be easily rewired to cleave a target 
transcript by fusing a fully complementary 
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Fig. 5 RNAi-based logic switches. (a) NOT (input produces microRNA that represses the output) and TRUE (active RNAi mediator is 
inhibited upon sensing a molecular input) logic functions with RNAi; (b-d) Modulating RNAi levels with proteins or small molecules 
through a protein or small-molecule-responsive aptamer inserted in the loop region of shRNA or pri-miRNA basal segment; (e) Regulation 
of RNAi with mRNA. As, — active siRNA strand; S, complementary strand; Pr, protection strand. Binding of the protection strand to the 
target mRNA releases As and allows for reconstitution of functional siRNA (S:As); (f) Modulating RNAi with transcription factors (TF) by 
expression of engineering microRNA from an inducible promoter. FL, fluorescence; (g) MicroRNA sensors created by fusing microRNA 
target sites in the 3’-UTR of an output gene or repressor that inhibits production of the output to generate false and true logic functions, 
respectively. 
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target sequence in the 3’-UTR (Fig. 5g) [52, 
77|. miRNA sensors with fully comple- 
mentary target sequences allow sensing 
endogenous miRNAs by cleaving target 
transcript without competing with miRNA 
endogenous functions that rely on transla- 
tional repression or mRNA deadenylation 
pathway [77]. Instead of having the RNAi 
target sequence placed directly in the out- 
put gene, the target sequence can be fused 
to a repressor of an output [52]. The re- 
pressor and the output then forms an 
RNAi inverter that releases the repression 
of an output upon sensing a miRNA input 
(Fig. 5g). 

Several other sensing mechanisms have 
been developed by engineering loop se- 
quences of miRNA precursors. The loop 
sequences of miRNA precursors are 
highly conserved across species, and this 
has been shown to be important for 
miRNA maturation [2]. Indeed, several 
RNA-binding proteins such as Lin-28, het- 
erogeneous nuclear ribonucleoprotein A1 
(hnRNPA1), and a splicing regulatory pro- 
tein KSRP have been reported to bind the 
loop of various miRNA precursors and 
to regulate miRNA function [109-111]. 
Accordingly, an RNAi-based protein sen- 
sor can be constructed by replacing the 
short hairpin RNA (shRNA) loop sequence 
with an aptamer that specifically binds a 
cognate protein (Fig. 5b). Saito et al. con- 
structed a protein-driven RNA sensor by 
inserting a kink-turn RNA motif (Kt) into 
the loop region of a synthetic shRNA [13]. 
An archaeal ribosomal protein L7Ae re- 
presses Dicer cleavage by binding the Kt 
motif in the loop region of the shRNA, 
turning on the output gene expression. 
This results in an L7Ae sensor where 
the output level correlates with the in- 
put level. By combining this sensor with 
an OFF translational switch that gener- 
ates output in the absence of L7Ae, Saito 


and colleagues demonstrated an efficient 
control of apoptosis in HeLa cells by simul- 
taneously repressing and activating two 
proteins with antagonist functions [13]. 
Recently, Kashida and colleagues demon- 
strated a strategy aimed at improving 
the inhibitory effect on Dicer cleavage 
of aptamer-shRNA that optimizes the de- 
gree of steric hindrance between Dicer 
and the shRNA-protein complex based 
on a three-dimensional structure design 
[105]. Two RNAi sensors were created to 
sense either U1A spliceosomal protein or 
nuclear factor-«B (NF-«B) p50 protein in 
293FT cells, suggesting that this platform 
can be applied for developing a variety of 
new protein sensors. 

Small-molecule sensors have been 
implemented using a similar strategy 
(Fig. 5c). Chung-il An and colleagues 
fused a theophylline aptamer to the 
loop region of shRNA that targets an 
enhanced green fluorescent protein 
(EGFP) reporter gene [103]. With this 
construct, EGFP levels are controlled by 
theophylline in a dose-dependent manner 
through modulating Dicer cleavage of 
the aptamer-shRNA. The theophylline 
aptamer-fused shRNA has also been used 
to modulate albumin gene expression 
[112]. An alternative RNAi small-molecule 
sensor was prepared by modular coupling 
of an aptamer, a competing strand, and an 
shRNA stem (Fig. 5c) [11]. Binding of the 
small-molecule input to the aptamer trig- 
gers strand displacement and base-pairing 
by the competing strand, which disrupts 
shRNA stem conformation and inhibits 
processing by the RNAi machinery. 
The input-output transfer function of 
these shRNA sensors was determined 
by combining experimental results with 
mathematical modeling, resulting in the 
identification of fine-tuning strategies that 
are related to sequence changes in the 
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competing strand or aptamer domains. By 
using this approach, two shRNA sensors 
were integrated to create a two-input 
circuit that was responsive to theophylline 
and hypoxanthine [11]. 

A sensory aptamer fused to the loop re- 
gion of shRNA may interfere with shRNA 
nucleus export and/or Drosha processing, 
resulting in a reduced RNAi efficiency 
[11]. Furthermore, growing evidence sug- 
gests that shRNA-based control systems 
can trigger cellular immune responses 
and cause in vivo toxicity [113, 114], 
which may limit the broader application 
of shRNA-based gene expression control 
systems. It has been reported that Drosha 
processing requires single-stranded basal 
segments in pre-miRNA, and that the 
bulge size in this region determines the 
Drosha processing efficiency [115]. Re- 
cently, Beisel et al. constructed a small- 
molecule-responsive miRNA by integrat- 
ing an aptamer into the pri-miRNA basal 
segment (Fig. 5d) [50]. With this config- 
uration, the interaction between aptamer 
and ligand strengths the local RNA sec- 
ondary structure in the miRNA basal seg- 
ment, resulting in an inhibition of Drosha 
processing and a reduction of RNAi effi- 
ciency. 

Sensors have also been developed for 
nucleic acid inputs and mRNA transcripts 
in cell-free systems [3, 116]. Because the 
signaling mediator in the RNAi signal- 
ing pathway is the small single-stranded 
miRNA molecule, it seems appropriate 
to search for a sensor design that trig- 
gers RNAi upon sensing a complemen- 
tary RNA strand. Recently, a prototype of 
such an mRNA sensor has been demon- 
strated in a Drosophila embryo lysate 
[104]. The mRNA sensor comprises three 
RNA strands: a protecting strand (Pr); an 
antisense strand (As); and a sense strand 


(S). The Pr and As strands are first an- 
nealed to form a partially dsRNA duplex 
(Fig. 5e). The Pr strand is complementary 
to the mRNA transcript, and base-pairing 
between Pr and the mRNA transcript 
releases the As strand via strand migra- 
tion and displacement. The As strand 
then hybridizes with the S strand to 
form an active siRNA molecule, which 
represses the cognate target gene through 
RNAi. 


23 
Information Processing and Actuation for 
RNAi-Based Logic Circuits 


A molecular logic evaluator (the compu- 
tational core in synthetic logic circuit) 
can integrate multiple signals transmitted 
by sensors, and process these to trigger 
cellular responses according to predefined 
logic or behavior. Several synthetic molec- 
ular logic evaluators (summarized in a 
recent review [3]) have been demonstrated 
in mammalian cells using RNA devices 
[19, 103], transcription factors [23, 117], 
and nanorobots [118]. Here, attention 
is mainly focused on an RNAi-based 
logic evaluator in mammalian cells 
[48, 49]. 

A complex Boolean logic expression can 
usually be broken into a hierarchy of 
simple clauses in several equivalent, but 
distinct, ways. For example, any Boolean 
logic statement can be rewritten in either 
disjunctive normal form (DNF) or con- 
junctive normal form (CNF). DNF consists 
of a sequence of disjunctive operations 
(OR) ona set of conjunctive clauses (AND). 
Each conjunctive clause considers one or 
more literals (i-e., inputs and negations of 
inputs) and evaluates to TRUE only when 
all literals in the clause are TRUE. Accord- 
ingly, a DNF logic statement is TRUE if 
any of its conjunctive clauses is TRUE. In 
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comparison, CNF consists of a sequence of 
conjunctive operations (AND) on a set of 
disjunctive clauses (OR), each considering 
one or more literals. A disjunctive clause is 
TRUE when one or more of literals in the 
clause evaluate(s) to TRUE. A CNF logic 
statement is TRUE only when all clauses 
are TRUE. 

Multiple RNAi sensors have been inte- 
grated to evaluate both DNF and CNF logic 
expressions in mammalian cells [48]. In 
the DNF logic circuit implementation, two 
or more output genes encode the same out- 
put but carry different target sequences of 
siRNA mediators in their 3’-UTR regions 
(Fig. 6a). Each output gene represents one 
clause in the DNF expression, producing 
the output protein (the TRUE result) only 
when all siRNA mediators are absent. The 
entire DNF logic circuit produces a high 
level of the output protein when any out- 
put gene (ie., clause) is TRUE. In the 
CNF logic circuit implementation, a sin- 
gle output gene is controlled by a repressor 
that is encoded by two or more repressor 
genes (Fig. 6b). Each repressor gene repre- 
sents a clause in the CNF logic expression 
and responds to multiple siRNA media- 
tors, such that the repressor is produced 
by a given clause only when none of the 
clause’s siRNA mediators are present. The 
CNF logic circuit generates a high level 
of the output protein only when none of 
the repressor genes/clauses produces the 
repressor. 

This computational core serves as a 
modular information-processing layer of 
RNAi-based circuits, connecting between 
the sensory layer and actuation layer. 
For example, a DNF computational core 
was coupled with three miRNA-generating 
promoters responsive to their cognate 
transcription factors, resulting in output 
expression only when the transcription 
factor profile is matched (Fig. 6c) [49]. 


Recently, a CNF RNAi-based logic cir- 
cuit was demonstrated that resulted in the 
specific expression of DsRed fluorescent 
protein in HeLa cells but not in other 
cell types, based on a logic evaluation 
of the expression profile of six miRNAs 
(Fig. 6d) [52]. When the actuation pro- 
tein was replaced with a pro-apoptotic 
protein — the human Bax protein — it was 
observed that among a mixture of cultured 
cells the HeLa cells were selectively killed 
[52]. 

One important benefit to the digital logic 
abstraction is that it simplifies initial cir- 
cuit design. However, concentrations of 
molecular inputs vary along a continuous 
range, and therefore an understanding 
of the input-output transfer function of 
RNAi-based sensors, as well as circuit 
dynamics, is essential when predicting 
whether the circuit will generate the de- 
sired output for a given set of conditions 
(Fig. 7). Additional network motifs such as 
feedback and feedforward regulation can 
help to produce robust and reliable outputs 
when the circuit operates with fluctuating 
molecular inputs in noisy environments. 
For example, in a ‘“HeLa-high” sensor, 
input miRNA releases the repression of 
a repressor such that the output is high 
only when the input is high (Fig. 5g). In 
optimizing the performance of the HeLa 
classifier circuit, it was found that a motif 
consisting of activator rtTA and repres- 
sor Lacl would confer a reliable output 
for various amounts of LacI upon sens- 
ing endogenous miRNA inputs [52]. The 
HeLa-high sensor was further optimized 
by adding a synthetic miRNA FF4 that 
is coexpressed with Lacl, resulting in 
a reduced leaky expression of the out- 
put in the absence of the miRNA input 
[52]. 
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Fig. 6 RNAi-based circuitry in disjunctive normal form (DNF) and conjunctive normal form (CNF). (a) DNF. 
Multiple genes encode the same output, but carry different microRNA target sites; the expression is true 
(OUTPUT = ON) if any of the microRNA sets are not present; (b) CNF. Multiple microRNAs and repressors 
regulate the same output gene; the expression is true only if all repressors are not present; (c) A DNF circuit 
that produces output protein (ZsYellow) when either Lacl-Krab or (Rheo and rtTA) transcription factors are not 
active; (d) A CNF classifier circuit that produces output (DsRed) only if a specific microRNA profile characteris- 


tic of HeLa cells is matched. 
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Fig. 7 Example transfer functions of basic RNAi-based switches. Adapted from Ref. [52]. 


3 
Applications 


3.1 
Smart Therapeutics 


One of the great promises of synthetic bi- 
ology is the creation of “smart” molecular 
devices for therapeutic purposes. Genetic 
circuits offer sophisticated actuation con- 
trol that may never be achievable with tra- 
ditional pharmaceuticals. One important 
capability of genetic circuits is multi-input 
signal sensing and processing that enable 
the production of a therapeutic response 
only under specific environmental and cel- 
lular conditions. miRNAs offer a practical 
and useful class of inputs as they are not 
only differentially expressed in various cell 
types but also function as actuators by 
directly modulating protein expression. 

A pioneering study demonstrating the 
potential of coupling miRNA sensing to 
transgene expression was conducted by 


Brown et al. [77]. In these investigations, 
the regulation of transgenes by endoge- 
nous miRNAs was tested in different con- 
figurations, and showed that the suppres- 
sion of a reporter gene was proportional 
to the number of miRNA target sites in 
its 3’-UTR (one, two, and four target sites 
tested). The results also confirmed that 
miRNA expression above a certain thresh- 
old value is required for an efficient target 
suppression. In further studies, endoge- 
nous miRNA was used to restrict trans- 
gene expression to a specific lineage, tissue 
or differentiation state ofa cell. Wu and col- 
leagues [119] developed baculoviral vectors 
that combined a tissue-specific promoter 
and miRNA regulation to restrict herpes 
simplex virus thymidine kinase (HSVtk) 
expression to glioblastoma cells. 

A similar approach of combining 
transcriptional (promoter) and _post- 
transcriptional (miRNA target) regulation 
was used by Leja et al. [120] to improve the 
safety of an oncolytic adenovirus designed 
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to treat neuroendocrine tumors, including 
carcinoids. Carcinoids tend to metastasize 
to the liver, and a liver-specific miRNA 
was used to suppress virus replication in 
normal hepatocytes. 

In another study, the miRNA target sites 
of two miRNAs that are typically downreg- 
ulated in prostate cancer cells were fused 
in the 3’-UTR of an essential viral gene of 
HSV-1 oncolytic virus, and this resulted 
in the restriction of viral replication and 
oncolysis to cancer cells [121]. 

Colin et al. [122] proposed the use 
of a pseudotyped lentivirus in combina- 
tion with post-transcriptional regulation 
of transgene by neuron-specific miR124 to 
restrict expression to astroglial cells of the 
central nervous system. 

Investigations conducted by Annoni and 
colleagues [123] addressed the issue of the 
unwanted autoimmune response that ac- 
companies gene therapy and is usually 
approached with nonspecific immune- 
suppression. These authors induced an 
antigen-specific tolerance by incorporat- 
ing hematopoietic-specific miRNA into 
an antigen-expressing transgene, and suc- 
cessfully excluded expression from the 
antigen-presenting cells (APCs). 

Gentner and colleagues [124] suggested 
yet another approach where specific 
miRNAs were used to restrict transgene 
expression spatially and temporally in 
order to improve the safety of gene therapy 
based on engineered hematopoietic stem 
and progenitor cells (HSPCs). Globoid 
cell leukodystrophy is a fatal metabolic 
disorder caused by a mutation in the 
galactocerebrosidase (GALC) gene, and for 
which the only method of slowing disease 
progression is bone marrow transplanta- 
tion. Although the genetic manipulation 
of HSPCs in bone marrow may provide 
an improved therapy, GALC expression 
in nondifferentiated HSPCs has toxic 


effects. Nonetheless, the study identified 
two miRNAs (miR-126 and miR-130a) 
that were expressed only in HSPCs, but 
not later in their differentiation process. 
The incorporation of miR-126 target 
sequences into a GALC-expression vector 
resulted in a suppression of GALC in 
HSPCs, offering a safer alternative to 
leukodystrophy gene therapy. 

The use of differentiation state-specific 
miRNAs in the transplantation of stem 
cell-derived tissues may also prove to be a 
highly efficient method of eliminating un- 
differentiated cells, to avoid unrestrained 
growth in vivo [77]. This concept was ex- 
plored by Sachdeva et al. [125], who used a 
miRNA-regulated lentiviral vector express- 
ing a fluorescent reporter gene to track 
and purify (using fluorescence-activated 
cell sorting) embryonic stem (ES) cell- 
derived neuronal progenitors for trans- 
plantation and analysis. This approach led 
to a substantial decrease in the number 
of undifferentiated stem cells in the 
transplant, and resulted in a reduced 
tumor formation and increased survival 
of the transplanted cells. 

As discussed above, the potential of 
miRNA-based circuits was also demon- 
strated in a recent study where a six-input 
classifier circuit was designed to recognize 
a defined miRNA profile and generate an 
output response (fluorescent or apoptotic 
protein) only in a specific cancer cell line 
[52]. Clearly, given the increased knowl- 
edge and availability of miRNA expression 
data [126, 127], it may be soon possible to 
design similar classifiers tailored to other 
types of cancer and diseases. 


3.2 
Learning by Engineering 


Another important application of synthetic 
circuits is in the elucidation of genetic 
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regulation and cellular function. In a re- 
cent study, an adenovirus-based miRNA 
sensor was used for the high-throughput 
analysis of functional profiles of 115 mikR- 
NAs in 12 different cell lines [127]. This 
allowed for the identification of miRNA 
signatures of several of the cell lines, 
and is likely to be applicable also to pri- 
mary tissues. Coupling the analysis with 
multiple-input sensors might enable the 
characterization of cell subpopulations. 

In an earlier study, a miRNA sensor was 
incorporated into lacZ-expressing vector 
to investigate miRNA spatiotemporal ex- 
pression patterns in C. elegans [128]. As 
a consequence, the tissue-specific expres- 
sion of several miRNAs was observed, 
two of which exhibited expression pat- 
terns reminiscent of Hox genes. One of the 
miRNAs was also found to directly regu- 
late Hox8b, which suggested that miRNAs 
might play an important role in patterning. 
In another study, fluorescent reporter vec- 
tors coupled with differentiation-specific 
miRNA sensors were used to track and seg- 
regate progeny from ES cells and induced 
pluripotent stem cells, as they differen- 
tiated towards a neuronal lineage [125]. 
Mukherji et al. [129] employed a bidirec- 
tional reporter expressing two fluorescent 
proteins, one of which was targeted by en- 
dogenous miRNAs, to study the modes 
of miRNA repression itself. For this, 
single-cell measurements of miRNA re- 
pression indicated that regulation by miR- 
NAs establishes a threshold level of mRNA 
below which protein expression is highly 
suppressed. At high target mRNA levels, 
the miRNAs fine-tune protein expression 
whereas, at low mRNA levels, they serve 
as a molecular OFF switch. 

As described above, synthetic miRNA 
circuits also provide a tool for studying 
the function of miRNA-mediated motifs. 
Studies conducted by Bleris et al. have 


provided a step towards elucidating the 
various regulatory roles of miRNAs by 
engineering and testing of synthetic tran- 
scriptional and post-transcriptional IFF 
circuits in mammalian cells [51]. 


4 
Conclusions and Perspectives 


During recent years, a deeper understand- 
ing of RNAi-mediated mechanisms has 
inspired the development of RNAi-based 
sensors for variant molecular signals. In 
the meantime, these RNAi-based syn- 
thetic biology tool kits have advanced 
the engineering of modular and com- 
plex RNAi-based genetic circuits that exe- 
cute biological actuation by sensing mul- 
tiple molecular signals and computing 
pre-programmed logic function. 

Yet, further challenges must be ad- 
dressed before more elaborate and robust 
RNAi-based logic circuits can be effi- 
ciently designed and implemented for the 
controlled manipulation of mammalian 
cells. From a genetics viewpoint, it will 
be necessary to expand the repertoire 
of RNAi-based sensors with quantita- 
tive measurements for sensing additional 
types of molecular input. Further studies 
are also required to elucidate the de- 
sign principles of RNAi regulatory motifs 
and to apply these to synthetic circuits. 
Indeed, when designing circuits, more 
sophisticated computational models and 
simulations can not only help in the se- 
lection of appropriate sets of synthetic 
components but also reduce the costs of 
the trial-and-error methods that are preva- 
lent in current synthetic biology efforts. 
An additional problem is the difficulty 
in delivering large synthetic circuits with 
multiple transcriptional units into mam- 
malian cells, and in maintaining circuit 
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function for intended periods. None of 
these challenges is unsolvable, however, 
and ongoing efforts in synthetic biology 
will continue to deliver new and exciting 
solutions enabling the use of RNAi-based 
circuits not only in biomedical applications 
but also in biological research in general. 
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Keywords 

Receptor: 


The biological recognition element used in the sensor. 


Target: 
The analyte of interest; the one which is to be quantified. 


Biorecognition: 
The process of identifying the target molecules through biocatalysis or bioaffinity 
reactions. 


Transducer: 
A device which converts the chemical signal from the biorecognition event into a 
quantifiable physical signal. 


Immobilization: 
The process of attaching receptor molecules to the transducer surface, without 
compromising its activity and selectivity. 


Biosensors are analytical tools in which biological or biologically derived 
receptor molecules are used as recognition elements in conjunction with 
physico-chemical transduction mechanisms. Biosensors can be classified according 
to the bio-recognition process or the transduction mechanism employed. Bioaffinity 
sensors involve affinity reactions between the receptor and the target, while 
biocatalytic sensors employ the specific catalysis of the target analyte by the biological 
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molecule. Depending on the transduction mechanism, biosensors can be divided 
in four broad types: electrochemical; optical; piezoelectric; and thermal. Biosensors 
find application in fields ranging from clinical and point-of-care diagnosis, medicine 
and drugs, process industries, environmental monitoring to defense and biowarfare. 
Present biosensor research is focused on developing more compact and easy-to-use 
devices while retaining the efficiency and sensitivity. 


1 
Introduction 


Biosensors are fascinating analytical 
tools that combine the specificity and 
sensitivity of biological processes with 
the physico-chemical transduction me- 
chanism to provide bioanalytical mea- 
surements. The International Union of 
Pure and Applied Chemistry (IUPAC) 
defines a biosensor as a device that uses 
specific biochemical reactions mediated 
by isolated enzymes, immunosystems, 
tissues, organelles or whole cells to 
detect chemical compounds, usually by 
electrical, thermal or optical signals [1, 
2]. In 1962, Leyland C. Clark was the 
first to elucidate the basic concept of 
the biosensor in his seminal report on 
“enzyme electrodes” [3], which he had 
built on his earlier invention of the 
oxygen electrode. Clark reasoned that 
the electrochemical detection of oxygen 
or hydrogen peroxide could be used for 
the analysis of a wide range of analytes 
that produce either oxygen or hydrogen 
peroxide on being acted upon by a specific 
enzyme(s) [4]. 

The field of biosensors can be divided 
into two broad categories of instrumen- 
tation: (i) sophisticated high-throughput 
laboratory instruments capable of deliver- 
ing rapid and accurate measurements; and 
(ii) easy-to-use, portable devices for use by 
the non-specialists for decentralized, in 
situ, or home analysis [4]. Medical diag- 


nostics — and in particular blood glucose 
sensors for diabetic patients — present the 
largest field of applications for biosen- 
sors, though they also find applications 
in food and process control, medicine, en- 
vironmental monitoring, and defense and 
security. A biosensor may serve different 
analytical functions; in a clinical diagno- 
sis it may just be required to determine 
whether a targeted analyte is above or be- 
low a certain threshold value, whereas in 
process control the sensor may be required 
to provide a continuous and precise feed- 
back about the analyte. Hence, the sensor 
needs to be designed to meet the require- 
ments of each and any application. 

Based on the type of biological 
recognition process involved, biosensors 
can be allocated to two categories: (i) 
biocatalytic, which are typically based 
on the selective catalysis of biochemical 
reactions by enzymes; and (ii) bioaffinity, 
in which affinity interactions resulting 
in the formation of biocomplexes — 
such as antigen-antibody, the hybridiza- 
tion of complementary single-strand 
DNAs, protein—nucleic acids, and 
chemoreceptor—ligand— provide a very 
selective and sensitive mechanism for 
biosensing. Based on the transduction 
mechanism, biosensors can be broadly 
divided into four categories: electrochem- 
ical; optical; piezoelectric; and thermal. 
A combination of two transduction 
mechanisms can also be used to yield 
better sensitivity. 
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The aim of this chapter is to provide a 
basic knowledge of the various types of 
biosensor, and to outline the underlying 
principles and general design criteria, by 
providing specific examples. 


2. 
Sensor Design 


Figure 1 shows a general schematic of 
a biosensor, which will usually consist 
of three main components: the biorecep- 
tor or recognition unit, which is used in 
conjunction with a transducer that con- 
verts the chemical information from the 
analyte—receptor interaction into an easily 
measurable and quantifiable signal which 
is then shown on a display unit. The 
sensor design also incorporates an associ- 
ated electronic circuit or signal processor. 
In a biosensor, enzymes, cell organelles, 
tissues, microorganisms, antibodies and 
nucleic acids are the commonly used 
bioreceptors. Biologically derived materi- 
als such as aptamers and apoenzymes, or 
biomimetic materials such as molecularly 
imprinted polymers (MIPs) can also be 
used as bioreceptors. The biological re- 
action must usually take place in close 


Analyte solution 
b> Target analyte 


ia Other molecules > 
e e 
Bio-recognition layer: 

° Enzymes 

e Antibodies 


* Organelles 
° Tissues 


° Whole cells etc. 


Fig. 1 


Signal amplification 
and processing 


vicinity of the transducer, so that the 
transducer can pick up most of the chemi- 
cal information from the receptor—analyte 
interaction. 

In order to create a viable biosensor, the 
biorecognition unit must be properly at- 
tached to the transducer surface, without 
affecting the former’s activity. This pro- 
cess, which is known as immobilization, is 
the most critical step in the fabrication of 
any biosensing device. The choice of im- 
mobilization method depends on several 
factors, including the nature of the bio- 
logical component, the type of transducer, 
the physico-chemical properties, and the 
environment in which the sensor is in- 
tended to be used. The most commonly 
used immobilization methods used are 
adsorption, covalent binding, intermolec- 
ular crosslinking, matrix entrapment, and 
membrane entrapment. 


e Physical adsorption utilizes a combi- 
nation of van der Waals forces, hy- 
drophobic interactions, H bonding and 
columbic interactions to immobilize bi- 
ological elements on the surface of the 
transducer. Many substrates, such as 
cellulose, collodion, collagen, silica gel, 
glass, alumina and hydroxyapatite are 


Transducer: 


1.Electrochemical 2.Optical 
3.Piezoelectric 4.Thermal 


Schematic representation of sensor design. 


known to adsorb biomolecules. How- 
ever, the interaction forces between the 
substrate and the immobilized elements 
are weak, and the latter may tend to be 
released over a period of time, leading 
to sensor dysfunction [5, 6]. 

Covalent binding involves the forma- 
tion of covalent bonds between the 
certain reactive groups of the biologi- 
cal element which do not play a role in 
the biorecognition process and the sub- 
strate surface, which is modified to have 
functional groups. Generally, the nucle- 
ophilic functional groups present in the 
amino acid side chain, such as amine, 
carboxylic acid, imidazole, thiol, and hy- 
droxyl are used for the coupling reaction. 
Coupling requires mild conditions such 
as low temperature, low ionic strength, 
and pH in the physiological range. Cova- 
lent binding leads to a uniform surface 
coverage and helps to eliminate certain 
problems such as instability, aggrega- 
tion, diffusion, and deactivation of the 
immobilized biocomponent [1]. 
Bifunctional or multifunctional re- 
agents such as glutaraldehyde, hexa- 
methylene di-isocyanate, 1,5-difluoro 
2,4-dinitrobenzene and _bisdiazoben- 
zidine-2,2’-disulfonic acid are used for 
immobilizing biocomponents through 
intermolecular crosslinking. The non- 
rigidity of the enzyme layer formed, 
the higher demands for amounts of 
biological material and the formation of 
multiple layers of enzyme, which nega- 
tively affects the activity, represent some 
of the disadvantages of this method. 
Moreover, larger diffusional barriers 
may delay interactions and increase the 
response time of the device [7]. 

In matrix entrapment, the polymeric gel 
matrix precursors are polymerized in 
the presence of the biological elements 
to be entrapped. The most commonly 
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used gels are polyacrylamide, polyvinyl 
alcohol, polycarbonate, cellulose acetate, 
starch, alginate, and silica gel. Matrix 
entrapment is usually not the preferred 
method of immobilization as it may lead 
to possible delays in response time due 
to a diffusional barrier to the analyte and 
the leakage of biological species during 
sensor operation, resulting in a loss of 
bioactivity. 

e In membrane entrapment, enzyme so- 
lutions, cell suspensions or tissue slices 
can simply be encapsulated in analyte- 
permeable preformed membranes 
on the electrochemical transducer. 
Self-assembled monolayers (SAMs) and 
bilayer lipid membranes (BLMs) can 
also be used to encapsulate biological 
molecules and bind them to the trans- 
ducer surface. A sol-gel method is used 
to immobilize biological molecules in 
ceramics, glasses, and other inorganic 
materials. Bulk modification of the 
entire electrode, for example, enzyme- 
modified carbon paste or graphite epoxy 
resin [8], magnetic interactions [9], and 
biotin—avidin binding [10, 11] are also 
effective methods of immobilization. 


3 
Electrochemical Biosensors 


In the simplest of terms, an electrochemi- 
cal biosensor can be defined as one which 
transduces or converts a biological event 
into a measurable, reproducible, and dis- 
crete electronic signal. Electrochemical 
biosensors combine the electrochemical 
transducers’ sensitivity with the specificity 
of biological recognition processes involv- 
ing biological elements such as enzymes, 
proteins, nucleic acids, antibodies, cells, 
or tissues [12]. Electrochemical biosensors 
provide easy fabrication, ease of operation, 
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portability, and entail low costs in man- 
ufacturing. An electrochemical study of a 
reaction will yield a measurable current 
(amperometry), a measurable charge ac- 
cumulation or potential (potentiometry), a 
change in the conductivity of the medium 
(conductometry), or changes in capaci- 
tance and/or resistance of the medium 
(impedance spectroscopy). As_ electro- 
chemistry is a surface phenomenon, elec- 
trochemical transduction does not require 
large sample volumes, and this provides 
for the effective miniaturization of biosen- 
sor devices. Electrochemical transduction 
helps in the speedy, continuous, real-time 
and inexpensive monitoring of many com- 
ponents in clinical laboratories and indus- 
tries [13]. 


3:1 
Amperometric Biosensors 


The amperometric technique involves ap- 
plying a fixed potential to the working 
electrode versus a reference electrode, and 
measuring the current produced as a result 
of the electrochemical reduction or oxida- 
tion occurring at the working electrode. 
This current is proportional to the concen- 
tration of the electroactive product, which 
is in turn proportional to the nonelectroac- 
tive substrate in the sample. However, 
the electrolysis current is limited by mass 
transfer rates. Amperometric biosensors 
provide an additional selectivity because 
the oxidation or reduction potential used is 
normally characteristic of a particular an- 
alyte [12]. Additionally, the fixed potential 
used in amperometry results in a negli- 
gible charging current, which minimizes 
the background signal. Simplicity and a 
low limit of detection make amperometric 
transduction a good choice for biocatalytic 
and bioaffinity sensors [14]. 


The electrochemical cell usually consists 
of a three electrodes system: (i) a work- 
ing electrode made from conductive inert 
metals such as Pt, Au, or graphite, and at 
which the biochemical reaction involving 
the target analyte occurs catalyzed, in most 
cases by enzymes which are immobilized 
on the electrode surface; (ii) a counterelec- 
trode, which is usually a Pt wire; and (iii) a 
reference electrode against which the po- 
tential measurement is made. However, 
if the current density is low (<uAcm~’), 
a two-electrode system without the refer- 
ence electrode can also be used; in fact, 
such a system is generally preferred in dis- 
posable sensors as long-term stability of 
the reference electrode is not required and 
the costs are lower [12]. 

The electrodes can easily be miniatur- 
ized to micron size, or even to nanometer 
size [15, 16], which results in low sam- 
ple volume requirements for detection of 
the analyte. Recently, screen-printed elec- 
trodes (SPEs) with patterned microelec- 
trodes have gained in popularity because 
of their low cost, ease, and speed of mass 
production [17]. Disposable SPEs have 
been used in immunochemical sensors 
and glucose sensors [18]. Interdigitated 
array electrodes consisting of two pairs 
of working electrodes made from parallel 
metal finger strips interdigitated and sep- 
arated by insulating materials may serve 
as another good amperometric transducer 
[12, 19}. 

Amperometric biosensors are typically 
based on enzyme electrodes. The simplest 
design of an amperometric biosensor is 
the direct detection of either the increase 
of an enzymatically produced electroac- 
tive species or the decrease of a substrate 
of a redox enzyme. A typical example 
of this design is a glucose sensor based 
on using glucose oxidase (GOx) as the 
biorecognition element. The increase in 
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Fig. 2. ‘“‘First-generation” ampero- 
metric biosensors. Reprinted with 

permission from Ref. [24]; © 2011, 
John Wiley & Sons. 


concentration of the product, H2Oz, or the 
decrease in concentration of the cosub- 
strate, O2, is electrochemically monitored 
in order to quantify the glucose concentra- 
tion [3, 20-23]. Such sensors are termed 
“first-generation” biosensors (Fig. 2). Un- 
fortunately, the reproducibility of these 
biosensors is dependent on the concen- 
tration of oxygen, while the electrode po- 
tential is prone to interference [24]. 

The use of artificial redox mediators 
was introduced to overcome the oper- 
ational problems associated with their 
first-generation counterparts. Cass et al. 
developed the first amperometric glucose 
biosensor based on ferrocene, a redox 
mediator [25]; such biosensors are termed 
“second-generation”’ biosensors (Fig. 3). 
Artificial redox mediators are small, 
soluble molecules capable of undergoing 
rapid and reversible redox reactions that 
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Fig. 3 “‘Second-generation” amper- 
ometric biosensors. Reprinted with 
permission from Ref. [24]; © 2011, 
John Wiley & Sons. 


shuttle electrons between the active site 
of the enzyme and the electrode surface. 
These sensors are prone to leakage of 
free diffusing redox mediators from the 
electrode surface, which adversely affects 
their long-term operational stability. 
However, this does not affect their 
successful application in one-shot devices, 
such as those for the self-monitoring of 
glucose [13]. 

A better biosensor architecture can be 
realized through the immobilization of a 
redox enzyme on the electrode surface, in 
such a way that a direct electron trans- 
fer is made possible between the active 
site of the enzyme (where the catalytic re- 
actions occur) and the transducer. Such 
a design obviates the use of freely dif- 
fusing redox mediators, and biosensors 
based on this design principle are termed 
“third-generation”’ biosensors (Fig. 4). A 
reagentless biosensor architecture can be 
realized by co-immobilizing the enzyme 
and the mediator at the electrode sur- 
face [24]. Third-generation biosensors have 
a greater stability and can be used for 
repeated measurements or continuous 
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“Third-generation” amper- 


monitoring, as neither the enzyme nor the 
mediator needs to be added. Consequently, 
such sensors are self-contained and the 
cost of each measurement is reduced [12]. 


3.2 
Potentiometric Biosensors 


Potentiometric biosensors are based on 
the principle of measurement of charge 
accumulation at the working electrode 
compared to a reference electrode, un- 
der the conditions of negligible or zero 
current, and are governed by the Nernst 
equation: 


E° + RT 
fain (1) 
nF log(ai) 
where E is the measured potential of the 
cell, E° is the standard cell potential at 
temperature T, R is the universal gas 
constant, n is the number of moles of 
electrons transferred in the cell reaction, F 
is Faraday’s constant (~96500C mol), 
and ajis the chemical activity of the 
species i. 


A typical potentiometric device set-up 
consists of a reference and one work- 
ing electrode in contact with the sample 
solution. Common electrodes used for po- 
tentiometric quantification are the glass 
PH electrodes and ion-selective electrodes 
(ISEs) for ions such as Kt, Nat, and Ca** 
[26]. These sensors can be converted into 
biosensors by using biological elements 
such as enzymes capable of catalyzing 
reactions that involve analyte molecules 
to produce ions for which the sensor is 
designed. An immobilized enzyme layer 
adjacent to the working electrode catalyzes 
a biological reaction involving the analyte 
in which ionic species are either con- 
sumed or produced. A local equilibrium 
is established at the sensor interface and 
the membrane potential, developed due 
to the difference in concentration of the 
ions across the membrane, is measured. 
The ISE generates an electrical signal in 
response to the change in concentration of 
ionic species. Currently, three types of ISE 
are used in potentiometric biosensors: 


e Glass electrodes for cations: these 
are made from a very thin hydrated 
glass membrane as the sensing el- 
ement. A transverse electrical poten- 
tial is developed due to the concen- 
tration-dependent competition between 
cations for specific binding sites. The 
selectivity of a glass electrode is deter- 
mined by the composition of the glass. 
A common glass electrode is shown in 
Fig. 5. 

e Gas electrodes, which are the usual glass 
PH electrodes coated with hydrophobic 
gas-permeable polypropylene or Teflon 
membranes selective for gases such as 
CO2, NH3, and H2S. The diffusion of 
gases through the membrane causes 
a change in the pH of the sensing 
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Fig. 5 Typical arrangement of a 
glass electrode. 


solution between the membrane and the 
electrode, which is then determined. 

e Solid-state electrodes which consist of 
a thin membrane of a specific ion 
conductor made from a mixture of Ag)S 
and AgX, where X is a halide anion. 


Although, amperometric transduction — 
given its good sensitivity and low limit of 
detection — is favored in the case of glu- 
cose biosensors, the details of a number 
of potentiometric biosensors have been re- 
ported [27, 28]. Potentiometric biosensors 
are known for their simplicity of opera- 
tion, and their continuous measurement 
capability makes them interesting for en- 
vironmental applications, especially for 
monitoring heavy metals and pesticides 
[13]. With limits of detection as low as 10° 
to 10~'! M, potentiometric biosensors are 
suitable for measuring low concentrations 
in small sample volumes as they do not 
chemically influence the sample [29]. 

The ion-selective field effect transistor 
(ISFET), a type of potentiometric device, 
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was developed by Bergeveld and suc- 
cessfully combines solid-state integrated 
circuit (IC) technology with ISEs. The 
chemical-sensitive property of the glass 
membrane electrode is used in conjunc- 
tion with the impedance-converting char- 
acteristics of the metal oxide semiconduc- 
tor field effect transistor (MOSFET), in 
which the metal electrode (gate) is re- 
moved and its function is taken over by 
the sample solution under study (Fig. 6). 
This modified FET is capable of detecting 
changes in ion concentration when the 
gate is exposed to a solution containing 
ions. Many biosensors based on ISFET 
have been described since the first report 
of the enzymatic-modified ISFET (enzyme 
field-effect transistor; EnFET) for the de- 
termination of penicillin [30]. ISFET-based 
enzyme biosensors can also be used to de- 
tect and quantify heavy metal ions and 
organic pollutants, through the inhibitory 
action of such species on enzyme activity 
[31]. 

These biosensors have many advantages 
over other types of biosensor, notably 
miniaturization, high sensitivity, low cost 
and multianalyte detection potential [32]. 
Unfortunately, however, they still suffer 
from a variety of fundamental and tech- 
nological problems, such as the impurity 
of the semiconductor layer and instability 
of the functional groups in the sensing 
layer [33]. On the other hand, ISFETs can 
be directly incorporated into the electronic 
signal processing circuitry [13], and this 
can lead to their integration in microsys- 
tems such as micro-total-analysis-systems 
(u-TAS) and lab-on-a-chip (LoC) [33]. 


3.3 
Conductometric Biosensors 


Conductometric techniques rely on mea- 
surements of the change in electrical 
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Fig. 6 Typical arrangement of an ISFET device. 


conductivity of the sample solution due 
to the production of charged species, 
such as ions and electrons, during the 
course of a biochemical reaction cat- 
alyzed by an enzyme. For example, ure- 
ase — which catalyzes the production of 
ionic species—can be used in combina- 
tion with conductometric transduction. 
Conductance measurements have a lower 
sensitivity compared to other techniques 
[34], as conductance is sensitive to tem- 
perature, faradaic processes, double-layer 
charging, and concentration polarizations 
[13]. These effects can be minimized, 
however, by using an alternating current 
voltage for measurements [34]. This will 
result in a higher limit of detection and 
reduces potential interferences from vari- 
ations in the ionic strength of the samples 
[13]. Conductometric techniques can be 
used to create inexpensive and disposable 
sensors; however, in order to obtain re- 
liable measurements the ionic strength 
of the sample solution should undergo a 
significant change. 

Conductometric biosensors have been 
used for environmental monitoring, as 
they provide an easy-to-use, accurate, 


selective, fast and cheap alternative to 
conventional methods of heavy metal 
determination, such as gas and liquid 
chromatography, spectrophotometry, and 
chemical and physical techniques which 
are time-consuming and require expen- 
sive instruments and skilled personnel. 
The presence of heavy metals can be 
determined using thin-film interdigitated 
planar conductometric electrodes, with 
enzymes such as GOx, butyric oxidase 
and urease having been used to detect 
Agt, Hg? and Pb?+ [13]. Conductometric 
biosensors have also been used to moni- 
tor the presence of organic pollutants and 
pesticides in the environment [35, 36]. 


4 
Optical Biosensors 


4.1 
Conjugated Polymer-Based Biosensors 


Conjugated polymers (CPs) are 
z-conjugated polymeric compounds in 
which the backbone is composed of 
alternating saturated and unsaturated 
bonds, while the backbone atoms are 


sp*-hybridized [37-41]. These sp” hybrid 
orbitals, which are bonded through o 
bonds with the remaining out-of-plane P, 
orbitals overlapping with the neighboring 
P,, orbitals, provide the movement of 
free electrons. Therefore, the p-orbital 
overlap is the origin of the emissive and 
conductive properties of CPs and provides 
unique optoelectronic properties under 
certain conditions. For example, CPs 
are highly conductive under chemically 
doped conditions and are good candidates 
for flexible electronic materials. The 
unique optoelectronic property of CPs 
has attracted much attention for use as 
effective optical transducers, with CPs 
emerging as the active materials for vari- 
ous applications including light-emitting 
diodes (LEDs), field effect transistors 
(FETs), light-emitting electrochemical 
cells (LECs), polymer actuators, plastic 
lasers, batteries, photovoltaic cells, and 
biomaterials for sensory applications. 
Conjugated polyelectrolytes (CPEs) are 
m-CPs that have charged (anionic or 
cationic) side chains [42-46]. In this case, 
ionic groups such as sulfonate, carboxy- 
late, phosphate and quaternary ammo- 
nium ion are introduced into the chemical 
structures of the CPs to change their polar- 
ity. These ionic functional groups usually 
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prevent the CPs from aggregating in water, 
and also control their solubility. In partic- 
ular, CPEs may be good candidates for bi- 
ological applications because the excellent 
water solubility of the CPs is essential for 
their homogeneous use in aqueous media. 
The ability to control the water-solubility 
of CPs might be considered as a sens- 
ing mechanisms for CPEs to be exploited 
as biosensory materials [47]. A change in 
water solubility by adding the target ana- 
lytes, and the subsequent conformational 
change of the CPs, lead to alterations in 
the fluorescence wavelength and intensity 
of CPs. Recently, hydrophobic CPs have 
also been prepared as nanoparticles in an 
aqueous environment and used as sensory 
materials [48, 49]. 

CPs are largely classified based on how 
they release the energy absorbed from 
excitation. A CP molecule in an excited 
state can lose either an emission of 
radiation (as fluorescence and phospho- 
rescence) or radiationless transition, such 
as the intersystem crossing shown in the 
Jabnolski diagram (Fig. 7). The emission 
of radiation from the lowest vibrational 
level of the excited state S; to any of the 
vibrational levels of the ground state So is 
termed fluorescence (fluorescence lifetime: 
10° to 10~”s). Although the population 
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of triplet states by direct absorption 
from the ground state is insignificant, 
a more efficient process exists for the 
population of triplet states from the 
lowest excited state in CPs (intersystem 
crossing). If intersystem crossing has 
occurred, and the initial spin state is 
different from the final energy levels, then 
the emission energy may change; this is 
termed phosphorescence. Once intersystem 
crossing has occurred, the molecule 
undergoes a singlet—triplet process within 
the lifetime of an excited singlet state 
(10-* s), and the life time of a triplet state 
is much longer than that of an excited state 
(ca. 10-* to 10s). This could be a good 
indication to judge the type of emission 
in CPs. However, fluorescence in CPs 
is usually statistically much more likely 
than phosphorescence, unless vibrational 
coupling between the excited singlet state 
and a triplet state causes intersystem 
crossing (this usually could occur at a very 
low temperature, <80K) [39]. A similar 
type of emission termed delayed fluores- 
cence has been identified which normally 
follows the fluorescence characteristic 
emission spectrum; however, the lifetime 
of delayed fluorescence in CPs is slightly 
shorter than that of phosphorescence 
as it is caused by a recombination of 
geminate electron hole pairs rather than 
triplet—triplet annihilation. 

Fluorescence from CPs is very sensi- 
tive to any environmental changes around 
CPs. The optical properties of CPs undergo 
dramatic changes such as fluorescence 
amplification, quenching, or nonradiative 
energy transfer when the light is absorbed 
[50], and therefore the provision of mecha- 
nisms for optical changes in CPs allows 
their implementation in sensing appli- 
cations. This appealing property of CPs 
provides a highly sensitive transduction 
mechanism by signal amplification of CP 


fluorescence, and also explains various 
detection modes. The signal-amplifying 
model of CPs was proposed by Swager 
and colleagues in 1995 [51, 52]. When 
a target analyte binds locally to a re- 
ceptor on a CP repeat unit, the entire 
conjugated backbone is affected due to its 
one-dimensional wire-like optoelectronic 
property, such that the fluorescence of 
the entire polymer chain is altered. The 
wiring of chemosensory molecules in se- 
ries provides a universal method by which 
to obtain signal amplification relative to 
single-molecule systems. CPs are “‘molec- 
ular wires,” as the key feature of a CP 
is that it can harness extended electronic 
communication and transport. However, 
the terms amplification and sensitivity en- 
hancement only indicate when a single 
event —the binding of an analyte-in a 
supramolecular polyreceptor system pro- 
duces a response larger than that afforded 
by a similar interaction in an analogous 
small monoreceptor system. 

The photophysical properties of CPs are 
strongly related to their polymer struc- 
ture, whether in solution and/or in solid 
state [50]. Changes in the chemical nature, 
effective conjugation length, intramolecu- 
lar conformation and intermolecular pack- 
ing will each have an influence on the 
color and intensity of fluorescence. The 
emission wavelength can be modulated 
through the design of backbone structure 
of CPs and by changing the charge density 
around the CP backbone. Side-chain mod- 
ification along the backbone, using either 
electron-rich or electron-deficient func- 
tional groups, provides the emission wave- 
length. The effective conjugation length 
is also a critical factor in determining 
wavelength, with long chains generally 
showing a longer wavelength emission. 
However, the fluorescence wavelength of 
CPs is not influenced further when the 


conjugation length of CP exceeds the 
“effective” conjugation length. Rather, the 
shorter lifetime of CPs and the exciton 
mobility — which may be limited by con- 
formational disorder in solution — will pre- 
vent diffusion throughout the entire length 
of high-molecular-weight CPs. 

Several detection modes in CPs have 
been actively developed for the sensing 
of chemical or biomolecules, including 
fluorescence turn-on (amplification) and 
turn-off (quenching), fluorescence color 
change, and visible color change (Fig. 8). 
In the turn-on mechanism, the fluores- 
cence signal of CPs is inherently excellent 
but is completely and partially quenched 
due to the change in electron density along 
the CP backbone as a result of conforma- 
tion changes or intramolecular packing. 
Target binding to CPs, and the associated 
conformational rearrangement of CPs or 
unpacking among the backbones, perturb 
the electronic state along the CP backbone 
and induce an enhancement in fluores- 
cence. Although the polymer may be sol- 
uble in water, the fluorescence quantum 
efficiency of CPs in aqueous solution may 
be low due to their limited water-solubility 
and the resultant polymer aggregation. 
The use of a surfactant to improve CO 
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solubility in water can also provide an 
improvement in signal turn-on, without 
affecting target binding and signal trans- 
duction by surfactants. 

Another interesting turn-on sensor has 
been developed as a colorimetric CP sen- 
sor by using polythiophene derivatives. 
These CPs provide a color change and 
signal enhancement when a target is 
bound to the receptor such that their 
conformation is altered. As an example, 
cationic polythiophene derivatives will 
form a duplex with negatively charged 
single-strand DNA molecules, which re- 
sults in polymer aggregation and, hence, 
fluorescence quenching. When a target 
single-strand DNA molecule is hybridized 
to the complementary receptor DNA, the 
DNA/DNA/polymer triplex will be less 
planar than when in the duplex conforma- 
tion, and so will have a shorter conjugation 
length and different absorption character- 
istics. Target molecules detected using this 
system vary from small DNA molecules to 
large protein molecules. 

In a turn-off system, a variety of mech- 
anisms can result in quenching, such as 
Forster or Dexter energy transfer, static 
quenching, complex-formation between 
polymers and a target, and collisional 
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Fig. 8 Conjugated polymer based biosensors: detection 
mechanisms and modes. Reproduced with permission 

from Ref. [50]; © 2010, Royal Society of Chemistry. 
(http://pubs.rsc.org/en/Content/ArticleLanding/2010/AN/c0an00239a) 
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quenching [53, 54]. For these reasons, 
the fluorescence quenching of CPs is of- 
ten dependent on environmental factors 
such as temperature and pressure. In most 
CP-based quenching systems, target bind- 
ing induces the electronic state of CPs by 
an intermolecular aggregation of the poly- 
mer chains. Such aggregation is often due 
to hydrophobic effects induced by target 
analytes, and the addition of surfactants 
or a change in temperature can prevent 
CP aggregation. Quenching that occurs 
upon interaction with a specific molecular 
biological target forms the basis of ac- 
tive optical contrast agents for molecular 
imaging. 

In a similar way, the fluorescence res- 
onance energy transfer (FRET)-induced 
detection mode also begins with the fluo- 
rescence quenching of CPs that normally 
are used as energy donors [55-57]. FRET 
(also known as Forster energy transfer) 
is a dynamic quenching mechanism be- 
cause energy transfer occurs while the 
CP donor is in an excited state. When 
a target analyte labeled with fluorescence 
acceptor molecules is bound to a target 
receptor and is located in the proximity 
of CPs (usually within 10nm), the CP as 
the donor chromophore will transfer en- 
ergy to an acceptor chromophore through 
dipole-dipole coupling. The efficiency of 
FRET depends on many physical param- 
eters, including the distance between the 
donor and the acceptor, the spectral over- 
lap of the donor emission spectrum and 
the acceptor absorption mechanism, and 
the relative orientation of the donor emis- 
sion dipole moment and acceptor absorp- 
tion dipole moment. The dominant factor 
among these CP-based sensors is the dis- 
tance between the donor and the accep- 
tor, because the efficiency of this energy 
transfer is inversely proportional to the 
sixth power of the distance between the 


CP and the acceptor dye. Consequently, 
FRET-based CPs have been used as potent 
tools to measure distances and to detect 
molecular interactions in a number of sys- 
tems, and are widely applied in biology 
and chemistry. 


4.2 
Surface Plasmon Resonance-Based 
Biosensors 


Surface plasmon resonance (SPR) was first 
observed by Wood in 1902, but a complete 
explanation of the phenomenon was not 
provided until 1968, by Otto [58-63]. Since 
the first application of SPR-based sensors 
to biomolecular interaction monitoring by 
Liedberg et al. in 1983, the phenomenon of 
SPR has served as a fascinating detection 
tool for biosensor applications [64]. When 
polarized light is shone through a prism 
on a sensor chip, on top of which is 
a thin film of metal (usually gold or 
silver); the film will act as a mirror and 
reflect the light. On changing the angle 
of incidence, the intensity of the reflected 
light will pass through a minimum. At 
the angle of incidence when this occurs, 
the light will excite the surface plasmon 
so as to induce SPR and cause a dip in 
the intensity of the reflected light. The 
angle at which the maximum loss of 
reflected light intensity occurs is termed 
the resonance angle or SPR angle. The SPR 
angle depends on the optical refractive 
indices of the media at both sides of 
the metal. The SPR conditions can be 
changed, and the shift of the SPR angle 
is suited to provide information on the 
kinetics of target adsorption on the metal 
surface. A schematic of a biosensor device 
based on SPR is shown in Fig. 9. 
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Fig. 9 Schematic of surface plasmon resonance (SPR)-based 
biosensor. Reproduced with permission from Ref. [65]; © 


2002, Macmillan Publishers Ltd. 


4.2.1 SPR Principles 

SPR can be used to monitor changes in 
the refractive index in the near vicinity 
of the metal surface. When the refrac- 
tive index changes, the angle at which 
the intensity minimum is observed will 
also change [66]. Hence, SPR not only 
provides an excellent means of measur- 
ing the difference between two states, but 
can also be used to monitor the inten- 
sity change in time-lapse. SPR sensors 
function in only a very limited vicinity or 
fixed volume at the metal surface, as the 
exponential decay of the evanescent field 
intensity in a typical SPR-based sensor 
presents practical blindness at distances 
beyond 600nm from its surface. A pro- 
cess occurring within the first few tens 
of nanometers from the metal surface 
will result in a few-fold higher response 


than the same process performed at a dis- 
tance of a few hundred nanometers. A 
signal observed at the penetration depth of 
the electromagnetic field is termed the 
evanescent field, and does not exceed a 
few hundred nanometers. The penetra- 
tion depth of the evanescent field is a 
function of the wavelength of the incident 
light. In order to provide selectivity for the 
SPR sensor, its surface must be modified 
with ligands that are suitable for captur- 
ing the target compounds and which are 
permanently immobilized on the sensor 
surface. 

SPR-based sensor applications are as- 
sociated with specific properties: (i) field 
enhancement; (ii) surface plasmon (SP) 
coherence length; and (iii) the phase jump 
of the reflected light upon SP excita- 
tion. In field enhancement, calculation of 
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the electric field transmission coefficient 
based on Fresnel’s equation for the inter- 
face shows that the electric field at the high 
refractive index side of the metal can be 
much smaller than that at the low index 
side of the metal layer. At very close to the 
SPR angle, the intensity can be enhanced 
by a factor of more than 30, a circumstance 
which explains much of the remarkable 
sensitivity that the SPR condition has for 
a changing dielectric environment. The 
metal thickness is also critical for SPR 
phenomena, and affects the biosensor’s 
efficiency; for example, with an excitation 
wavelength of 700nm the field enhance- 
ment is normally maximized when the 
gold layer is about 50 nm thick, but is de- 
creased as the gold layer become either 
thicker or thinner. 

SP coherence length implies that the 
field intensity of SPR decays with a 
characteristic distance 1/2k,", where the 
metal’s dielectric constant is complex 
and a complex propagation constant 
is ky = ky” (real parts) + jky” (imaginary 
parts). For gold or silver (the most fre- 
quently used metals in sensory applica- 
tions), the imaginary part of the dielectric 
constant increases with decreasing wave- 
length, and the SP propagation length 
decreases accordingly. Hence, the SP prop- 
agation length will become longer with 
increasing wavelengths used for the sen- 
sor studies. Finally, phase jump refers to 
the reflection event at an interface, that is 
generally accompanied by a phase jump 
of the reflected field. The phase of the re- 
flected electric field undergoes a relatively 
large change around the SPR dip, and this 
is critical for sensing purposes. However, 
the absolute values found are of limited 
validity due to the complicated experimen- 
tal set-up, though a phase measurement 
will provide an order-of-magnitude better 
sensitivity. 


4.2.2 Surface Chemistry in SPR Technique 
Detection processes in SPR between a 
target and a receptor are critical to provide 
a thorough understanding of all processes 
in living organisms [60]. Fast, selective 
and quantitative analyses, without a need 
for labels for optical biosensors, is the 
key to this situation. However, an elegant 
direct detection technique for label-free 
targets in bare SPR substrates provides a 
nonspecific binding of other components, 
as well as the desired specific targets 
of a biomolecular interaction. Another 
issue when using SPR detection without 
a relevant surface modification is the 
irreversible binding of numerous proteins 
and other biomolecules, as this can result 
in a failure to completely regenerate chip 
surfaces for reuse. In fact, after being used 
only a couple of times, the SPR substrates 
will be only partially regenerated, and 
90% or more of their activity will be 
lost. Thus, it is clear that the surface 
energy and surface charge of the SPR 
chips must both be carefully modulated 
to obtain not only a minimum false signal 
but also maximum detection yields. In 
addition to a high signal-to-noise ratio, the 
ability to retrieve the biological activity of 
an immobilized ligand is essential when 
creating successful SPR-based sensors. 
The transport of targets to the SPR chip 
surface by convection and diffusion has a 
profound effect on the signal. A correctly 
selected surface nanoarchitecture is ex- 
tremely important to control the amount of 
immobilized ligand. The steric hindrance 
of target binding sites via a chemical im- 
mobilization process has a strong effect 
on the affinity of the ligand towards the 
target molecules. A sufficient spacing be- 
tween the ligands can help to minimize 
any steric problems by controlling the 
ligand densities. The binding rate and 


equilibrium constants can normally be de- 
termined based on a novel distribution 
analysis. A detailed characterization of the 
distribution of binding properties provides 
a useful tool for optimizing surface modi- 
fications to achieve an effective functional- 
ization of biosensor surfaces with uniform 
high-affinity binding sites, and also for 
studying immobilization processes and 
surface properties. The surface charge may 
also influence the interaction kinetics and 
the extent of nonspecific interactions. 
Controlling the hydrophilicity can 
achieve protection of the sensitive 
biomolecular ligands by the appropriate 
selection of a functional group. Non- 
specific binding can be prevented by 
introducing a bioinert layer, while the 
most popular functional group is the 
carboxylate ion. The optimum thickness 
of the bioinert layer is based on the 
exponentially decaying strength of the 
evanescent field. A thickness >10nm 
would reduce the sensitivity of the binding 
signal, whereas a thickness <1 nm would 
usually cause an inhomogeneous coating 
of the metal surface. Hence, the preferred 
thickness of the bioinert layer, as an 
adhesion-linking layer, should be 2-5 nm. 


4.2.3. Surface Plasmon Fluorescence 
Technique 

Among the various sensing principles 
proposed for biosensor studies, surface 
plasmon fluorescence spectroscopy has, 
in particular, found widespread applica- 
tion and has demonstrated its potential for 
the sensitive detection of targets in several 
examples. SPR provides a label-free de- 
tection principle, as only the presence of 
bound analytes will slightly alter the optical 
architecture at the sensor surface probed 
by the surface plasmon mode propagat- 
ing along this metal/dielectric interface. 
Another reason for such rapid growth is 
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that surface plasmon fluorescence-based 
detection principles present attractive sen- 
sitivities for the in situ and real-time mon- 
itoring of biological targets. In addition, 
several facile surface-modification proto- 
cols are available to obtain the required 
functionalization of the sensor surface for 
label-free detections. 

As the intensity profile normal to the 
metal/dielectric interface decreases ex- 
ponentially in the direction into the 
metal, this suggests that the analyte 
molecules must be brought as close to 
the metal surface as possible in order to 
place their chromophores into the high- 
est possible optical field. Metal-enhanced 
fluorescence is wavelength- as well as 
environment-dependent; notably, the en- 
vironment (e.g., the solvent) can affect the 
enhancement factors [67]. Other effects on 
chromophores in the excited state close to 
a metal surface such as gold, and which 
can lead to a quenching of fluorescence, 
should also be considered. The fluores- 
cence subsequently decreases with close 
contact, most likely due to FRET at very 
close contact. At intermediate distances, 
however, an efficient back-coupling of 
the excitation energy from the vibrational 
relaxed—excited state of the chromophore 
to the metal substrate becomes the driv- 
ing force for the excitation of a red-shifted 
plasmon mode that can re-radiate via the 
prism at its respective resonance angle; 
this effect can be used to enhance the flu- 
orescence emission. Fluorescence emitted 
directly from chromophores is sufficiently 
separated from the substrate surface, but 
can still be enhanced within the enhanced 
optical field of surface plasmon mode. This 
combination of field enhancement and flu- 
orescence detection has been applied to a 
range of chemical and biosensing studies 
[68-80]. 
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4.3 
Surface-Enhanced Raman 
Spectroscopy-Based Biosensors 


Surface-enhanced Raman _ spectroscopy 
(SERS) or surface-enhanced Raman scat- 
tering is a surface-sensitive technique that 
enhances Raman scattering by molecules 
absorbed on rough metal surfaces [81]. 
Since the discovery in 1974 by Fleis- 
chmann et al., that a high-intensity Raman 
scattering of small molecules could be 
achieved on an electrochemically rough- 
ened silver surface, the field of SERS has 
expanded dramatically due to improve- 
ments in technique that have resulted 
from advances in nanotechnology and 
improved instrumental capabilities [82]. 
Today, the SERS technique is becoming 
widespread and is encountering new and 
exciting horizons in analytical chemistry, 
biology and biotechnology, forensic sci- 
ence, and also in the study of artistic 
objects. Although the exact mechanism 
of SERS remains a matter of debate, and 
the mechanisms proposed experimentally 
have not been straightforward, two pri- 
mary theories have persisted: (i) an elec- 
tromagnetic theory based on the excitation 
of localized surface plasmons; and (ii) a 
chemical theory based on the formation 
of charge-transfer complexes [83-88]. As 
the chemical theory applies only to species 
which have formed a chemical bond with 
the surface, it cannot explain the observed 
signal enhancement in all cases. In con- 
trast, the electromagnetic theory can be 
applied even to those cases where the 
specimen is absorbed only physically to 
the surface. Although the electromagnetic 
theory of enhancement can be applied re- 
gardless of the molecule being studied, 
it does not fully illustrate the magnitude 
of the enhancement observed in many 
molecules which have lone pair electrons 


and are bound to the surface. In this 
situation, the enhancement mechanism 
cannot be solely explained by involving 
surface plasmons. The chemical mech- 
anism involves charge transfer between 
the chemically adsorbed species and the 
metal surface; in this case a spectroscopic 
transition — which takes place in the ul- 
traviolet range and where the metal acts 
as a charge-transfer intermediate — can be 
excited by visible light. 

An increase in the Raman signal on 
metal surfaces occurs due to enhance- 
ments in the electric field provided by the 
surface (Fig. 10). The light incident on the 
surface can excite a variety of phenomena 
in the surface, but the complexity of this 
condition can be simplified by surfaces 
with features much smaller than the wave- 
length of the light. For this explanation, 
one useful approximation to solve en- 
hancement numerically —- and which has 
been widely used in the literature — is the 
electrostatic approximation. In this case, the 
problem can be solved as in electrostatics, 
and the approximation corresponds then 
to ignoring the presence of the wave vector 
k. Therefore, the applied electric field does 
not have a wavelength; rather, it is a uni- 
form field oscillating up and down with 
frequency. Although this approximation 
fails in many cases, it is not too difficult to 
imagine that the electrostatic approxima- 
tion functions well when the size of the ob- 
ject is much smaller than the wavelength. 
This means that the electrostatic approxi- 
mation will be valid mostly for objects of 
typical sizes in the range of about 10 nm or 
smaller. Another factor that affects the in- 
tensity in enhancement is the shape of the 
features. Objects with different shapes will 
have different resonances, and more than 
one resonance condition associated with 
a given shape. The local field intensity 
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Fig. 10 Surface-enhanced Raman spectroscopy (SERS)-based biosensor. 


enhancement factor at two different wave- 
lengths is strongly position-dependent in 
most cases, and the direction of the polar- 
ization is vertical. Intensity enhancement 
in more complicated shapes than in the 
simplest cases of a cylinder or sphere can 
also be very high in some circumstances. 

In electrostatic approximations for the 
calculation of field enhancement, size is 
not really important because the local 
field intensity enhancement factor will 
be the same as in the approximation. 
However, size is important if the objects 
are in the range of typical dimensions 
of ~30 to 100nm. Generally, localized 
SPR red shifts as the size increases. 
SPR are also strongly damped as the 
size increases, mostly as a result of 
increased radiation losses. This results in 
a broadening of the resonance, such that 
the latter will eventually disappear for large 
sizes (typically 100 nm for dipolar localized 
surface plasmon in sphere, but possibly 
larger sizes for other geometries). 

The choice of metal in SERS experi- 
ments is also critical for improving the 


results obtained. Generally, it is clear that 
silver outperforms gold, an advantage that 
can be tracked down to the higher ab- 
sorption of gold at the frequencies where 
resonance occurs. However, the red shift 
induced by object interaction and shape 
and size effects can push the resonance in 
gold to the wavelength region (>600 nm). 
In this case, gold may be as good as sil- 
ver, especially for bioapplications. Many 
biological applications of the techniques 
are based on near-infrared lasers (typical 
examples being diode lasers at ~750 or 
~830nm), and gold will be probably the 
most preferred plasmonic substrate. 


5 
Piezoelectric Biosensors 


Piezoelectric biosensors are sensing de- 
vices which couple the bioaffinity recog- 
nition processes between the biological 
probe molecule and the target analyte 
molecule with the acoustic wave-based 
transduction mechanism, better known as 
the piezoelectric effect. The demonstration 
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of a linear relationship between the mass 
adsorbed onto the surface of the piezoelec- 
tric crystal and its resonant frequency [89] 
and the development of suitable oscillator 
circuits for their operation in liquid [90], 
led to their application in the field of bi- 
ological sensing. Piezoelectric crystals can 
be combined with interfacial chemistry 
for the immobilization of biorecognition 
elements and macro- and micro-fluidic 
systems which enable a controlled contact 
of the analyte solution with the receptors 
[91], so as to yield efficient biosensing 
devices. 

Certain solid materials (especially crys- 
tals lacking a center of symmetry) that 
demonstrate charge accumulation upon 
the application of mechanical stress or in- 
ternal mechanical strain when subjected 
to an external electric field are said to ex- 
hibit the ‘piezoelectric effect.” In general, 
a mechanical stress that originates from a 
change in the mass of the adsorbed film 
on the piezoelectric crystal changes the 
resonance frequency of the crystal; this 
relationship is given by: 
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where Af is the change in resonant 
frequency (in Hz); fo is the resonant 
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Fig. 11 Schematic of a piezoelectric transducer. 


frequency of the crystal (in MHz); Am 
is the change in mass (in g); A is the 
piezoelectrically active area of the crystal, 
between the electrodes (in cm’); p is the 
density of the crystal (in gcm~3); and w 
is the shear modulus of the crystal (in 
econ" 8), 

The change Af of the resonant fre- 
quency fo of the piezoelectric crystal is 
directly proportional to the change in 
mass, Am. More specifically, the density, 
viscosity, elasticity, electric conductivity 
and dielectric constant of the sensing el- 
ement can also undergo changes and, in 
turn, affect the piezoelectric transducer 
[92]. For an acoustically thin film, the 
mass change affects the transducer re- 
sponse [93, 94], whereas for an acoustically 
thick film, the film’s viscous and elas- 
tic properties and geometric features also 
make significant contributions. Generally, 
the change in mass is central to the ap- 
plication of piezoelectric transducers in 
biosensors. However, there are instances 
when the ability of such transducers to 
quantify changes in shear modulus and 
viscosity has been exploited to fabricate ef- 
ficient biosensors to study lipids and mem- 
branes. Figure 11 shows a typical piezo- 
electric transducer; here, the biosensing 
layer with the immobilized bioreceptors 


Output signal 


Piezoelectric substrate 


can be fabricated over the transducer sur- 
face. Piezoelectric biosensors have been 
employed in the label-free detection of 
a wide array of analytes ranging from 
proteins, oligonucleotides and DNAs, anti- 
gens, small molecules to viruses and bac- 
teria. They have also been used widely 
to study protein-protein, protein-DNA, 
protein—peptide, peptide—peptide interac- 
tions, as well as interactions of carbo- 
hydrates with proteins, lipids, and other 
carbohydrates. 

In general, for biosensor applications 
the piezoelectric material should be capa- 
ble of operation in the liquid media. For ef- 
ficient operation in the liquid medium, the 
acoustic waves must be either shear hor- 
izontally polarized, or their phase speed 
should be lower than the speed of sound 
propagation in the liquid [95]. The dielec- 
tric constant of the material should match 
that of the medium in which the device is 
to be used, in order to prevent a capacitive 
short-circuit of the electric field at the in- 
terdigital transducers (IDTs) [96]. Quartz, 
lithium niobate (LiNbO3), potassium nio- 
bate (KNbOs), lithium tantalate (LiTaOs;), 
and langasite (lanthanum gallium silicate) 
are some of the most commonly used 
piezoelectric crystals for the fabrication of 
acoustic devices. Knowledge of the specific 
properties of these different types of acous- 
tic wave devices, such as the mechanical 
displacement of acoustic waves, the spatial 
distribution of mechanical and electrical 
fields, susceptibility to spurious coupling 
modes, and the sensitivity to temperature 
and pressure, is also important for the 
design of an efficient biosensor [92]. 

Depending on the acoustic wave-guiding 
mechanism, acoustic wave devices can 
be divided into three categories: (1) bulk 
acoustic wave (BAW) devices, in which 
the wave propagates unguided through 
the volume of the material; (2) surface 
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acoustic wave (SAW) devices, in which 
the wave propagates, guided or unguided, 
along a single surface of the material; and 
(3) acoustic plate mode (APM) devices, in 
which the acoustic waves are guided by 
reflection from multiple surfaces [93]. The 
SAW and APM devices can be grouped to- 
gether as surface-generated acoustic wave 
(SGAW) devices. 


Del 
Bulk Acoustic Wave (BAW) Sensors 


The BAW sensors, better known as thick- 
ness shear mode (TSM) devices or quartz 
crystal microbalance (QCM) devices, have 
traditionally been the choice of transducers 
for biosensors [97]. In the QCM bulk wave 
devices, the acoustic wave travels unguided 
through the entire volume of the piezoelec- 
tric substrate, resulting in vibration of the 
complete substrate. The displacement is 
maximized at the surface of the crystal, 
which makes the devices sensitive to sur- 
face interactions [98]. A typical BAW device 
consists of a piezoelectric crystal sand- 
wiched between two electrodes that are 
generally produced by vapor-depositing Au 
or Pt onto the electrode surface. An elec- 
tric field applied between the electrodes 
results in the mechanical oscillation of a 
standing shear wave across the bulk of the 
crystal at its natural resonant frequency. 
The frequency of the vibration depends on 
the properties of the crystal (size, density, 
cut, and shear modulus), and also on the 
properties of the phases adjacent to it [99]. 
This frequency changes when the target 
analyte molecules become attached to the 
bioactive layer that has been immobilized 
on the piezoelectric substrate, and this con- 
stitutes the output signal from the sensing 
device pertaining to the analyte. The sen- 
sitivity of these devices is limited by the 
thickness of the piezoelectric crystal. For 
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higher sensitivity, a higher resonant fre- 
quency is required; this can be achieved by 
reducing the thickness of the crystal, but 
thinner crystals are more fragile and diffi- 
cult to handle. The QCM devices have been 
well investigated for the past 50 years, and 
have subsequently become a mature, com- 
mercially available, robust, and affordable 
technology [100, 101]. Typically, frequen- 
cies between 5 and 30 MHz are used. 

The BAW sensors can generally be 
operated in two ways. In the first 
(‘dip-and-dry’”’) method, the reaction be- 
tween the analyte and the immobilized 
biorecognition element takes place in the 
solution phase, while the analysis and 
quantification occur under the gas phase 
[99]. The method involves measuring the 
vibrational frequency of the piezoelectric 
quartz crystal (PQC) before dipping the 
device in the analyte solution for a stip- 
ulated time. The device is then rinsed 
to remove any nonspecifically bound 
molecules, dried, and the vibrational fre- 
quency is measured again. Shons et al. 
[102] described the first PQC biosensor in 
1972, while Grande et al. [103] discussed 
some of the complications of dip-and-dry 
methods, such as solvent retention. Unfor- 
tunately, the dip-and-dry method does not 
provide any real-time analysis. The second 
method involves solution-phase sensing 
for which the contact cell is configured in 
a flow or batch mode and a peristaltic or 
syringe pump is used to introduce the test 
solution into the cell [99]. Solution-phase 
sensing allows for real-time analysis. 

Previously, BAW sensors have been 
used for the detection of viruses, bac- 
teria and other cells. Lee et al. [104] 
demonstrated sensitivity comparable to 
an enzyme-linked immunosorbent assay 
(ELISA) for the detection of cattle bovine 
ephemeral fever virus. The application of 
QCM biosensors in microbiology can be 


categorized into three areas: the direct 
detection of a microbe or spore; the detec- 
tion of an associated antigen or toxin; and 
the study and characterization of biofilm 
formed by a microbe [91]. Biosensors for 
a large number of bacteria, and for the 
toxins produced by them, have been re- 
ported. QCM sensors have been used 
successfully to monitor and quantify some 
key processes in biofilm formation and 
colonization in real time. QCM sensors 
have also been used to determine proteins, 
small molecules such as drugs, hormones 
and pesticides, and nucleic acids. 


5.2 
Surface-Generated Acoustic Wave (SGAW) 
Sensors 


SAW and APM devices can be grouped 
together as SGAW devices as both involve 
the generation and detection of acoustic 
waves at the surface of the piezoelectric 
crystal by means of IDT [94]. Since, the 
acoustic wave is confined to the surface 
of the crystals, these devices are not af- 
fected by the crystal thickness [98]. SAW 
devices operate at higher frequencies than 
BAW devices which, in principle, may lead 
to higher sensitivities because the acous- 
tic wave penetration depth in the adjacent 
media is reduced [105]. In a typical con- 
figuration, an electrical signal is converted 
at the input IDT into a polarized trans- 
verse acoustic wave traveling parallel to the 
substrate surface. The amplitude and/or 
velocity of the wave are affected by any cou- 
pling reaction at the surface. The output 
IDT at the opposite end picks up the acous- 
tic wave and converts it back to an electrical 
signal; any attenuation of the wave is then 
reflected in the output signal (Fig. 12). 
Depending on the piezoelectric substrate 
material, the crystal cut, the positioning of 
IDTs on the substrate, plate thickness and 
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Fig. 12 Surface-generated acoustic wave 
(SGAW) sensor set-up. The arrows at the top 
indicate the flow of the liquid sample (1) in 
which the sensor is immersed. The elements 
of the SGAW biosensor are a piezoelectric 
crystal (2), IDTs (3), the surface acoustic 
wave (4), and immobilized antibodies (5) 


wave guide mechanism, different opera- 
tional modes of SGAW such as shear hor- 
izontal surface acoustic wave (SH-SAW), 
surface transverse wave (STW), Love wave, 
shear horizontal acoustic plate mode 
(SH-APM), and layer-guided acoustic plate 
mode (LG-APM) can be achieved [95]. For 
a better understanding of these SGAW 
modes, an excellent review is provided 
in Ref. [95]. SGAW devices can be 
manufactured using IC microfabrication 
or central metallica-oxen semiconductor 
(CMOS) techniques, which allows for the 
integration of a signal processing unit in 
the sensor architecture itself [95, 106]. 
The basic set-up of a SGAW-based 
biosensor consists of a piezoelectric trans- 
ducer with an immobilized biospecific 
layer coupled to a driving electronic cir- 
cuit and integrated with a sample flow 
mechanism driven by a peristaltic or sy- 
ringe pump. Different SGAW techniques 
have been used according to the sensitivity 
and operational requirements. Love wave 
sensors are the most sensitive, with an 


corresponding to the analyte molecules (6) in 
the sample. The driving electronics (7) operate 
the SAW biosensor and generate changes in 
the output signal (8) as the analyte binds to 
the sensor surface. Reproduced with permis- 
sion from Ref. [107]; © 2008, Springer Science 
and Business Media. 


operating range of 80-300 MHz and mass 
sensitivity of 150-500cm? g~! [108, 109]. 
When Gizeli et al. [110] reported the first 
biosensor based on Love waves in 1992, 
the device consisted of a quartz crystal 
with poly(methylmethacrylate) (PMMA) 
wave guide layer and immunoglobulin 
G (IgG) immobilized on the surface as 
the probe. Among others, a STW de- 
vice is reported to have a sensitivity of 
100-200 cm* g~! and an operating fre- 
quency of 30-300 MHz [108, 109]. Respec- 
tive values for SH-APM are 20-50 cm? g™! 
[93, 108, 109] and 25-200 MHz, and for 
LG-APM are 20—40 cm? g~! [108, 109] and 
25-200 MHz. 


6 
Thermal Biosensors 


Thermal biosensors function by moni- 
toring the change in temperature due 
to the enthalpy changes associated with 
any biochemical reaction, and as such 
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are independent of the optical or elec- 
trochemical properties of the biocatalyst 
(usually enzyme), substrate or product. 
The invention of the enzyme thermistor 
(ET), which couples flow injection analy- 
sis (FIA) with an immobilized biocatalyst 
and a heat-sensing element [111], led to a 
surge in investigations into the design and 
applications of such biosensors. Their ver- 
satility and superior operational stability 
make thermal biosensors useful for such 
diverse applications as clinical analysis, 
food analysis, industrial process monitor- 
ing and environmental monitoring. 

Any biochemical reaction is accompa- 
nied by the evolution or absorption of 
heat. The change in temperature (AT) of 
the system can be defined in terms of the 
enthalpy change associated with the reac- 
tion and the heat capacity of the system as: 
total heat evolved or absorbed during the 
reaction is given by: 


—Ny: AH 


AT = 
Cs 


(3) 


where ny is the total number of moles 
of the product, AH is the molar enthalpy 


change associated with the reaction and 
Cs is the total heat capacity of the system. 
The enthalpy changes associated with 
biochemical enzymatic reactions are 
usually in the range of 10 to 200kJ mol}, 
and are adequate to determine substrate 
concentrations at clinically interesting 
levels [112]. The molar enthalpy changes 
for some common enzyme-catalyzed 
reactions are listed in Table 1. The total 
enthalpy change is the sum of enthalpy 
changes associated with individual 
reactions. Thus, the measurement can 
be improved by coimmobilizing two 
enzymes —for example, oxidases with 
catalases — using a high-protonation en- 
thalpy buffer such as tris(hydroxymethy]) 
aminomethane (TRIS) in the case of 
proton-producing biochemical reactions 
[112], using organic solvents which 
have lower heat capacities than aqueous 
solvents [113], or by the enzymatic 
recycling of the substrate where the net 
enthalpy change in each cycle adds to 
the overall enthalpy change [114, 115}. 
An enthalpy change of 100kJj mol"! is 
sufficient for the detection of analyte 
concentrations down to 5umol!"?. 


Tab. 1 Molar enthalpy changes for some common enzyme reactions [112]. 
Enzyme Substrate Enthalpy change (AH; kj mol-") 
Catalase Hydrogen peroxide 100 
Cholesterol oxidase Cholesterol 53 
Glucose oxidase Glucose 80-100 
Hexokinase Glucose 75* 
Lactate dehydrogenase Sodium pyruvate 62 
NADH dehydrogenase NADH 225 
f-Lactamase Penicillin G 115* 
Trypsin Benzoyl-t-arginine amide 29 
Urease Urea 61 
Uricase Urate 49 

*In Tris buffer (protonation enthalpy: -47.5 kJ mol~'). 


Reproduced with permission from Ref. [112]; © 2012, Elsevier. 
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Fig. 13. Schematic of an enzyme thermistor. Reproduced with 


permission from Ref. [112]; © 2012, Elsevier. 


The conventional thermometric device 
consists of a working column with the 
enzyme immobilized on a supporting ma- 
trix, and a thermal transducer (usually a 
thermistor) placed in the vicinity of the col- 
umn. A schematic of a flow injection anal- 
ysis enzyme thermistor (FIA-ET), a com- 
monly used thermal biosensor, is shown 
in Fig. 13. The broad design includes an 
external jacket for insulation, a working 
column with immobilized enzymes, an 
indirect placement of the thermistor to 
prevent fouling, a heat exchanger prior 
to the working column to avoid temper- 
ature fluctuation, and a peristaltic pump 
to drive the buffer and analyte solution 
through the system. Nonspecificity is in- 
herent in calorimetry, since all enthalpy 
changes in the reaction contribute to the 
measurement. However, this problem can 
be overcome by having a split-flow arrange- 
ment in which the test solution passes 
through two different columns. Typically, 
an active column and a reference column 
containing only the support matrix or, in 


some cases inactivated enzyme, is often 
used to minimize the effects of nonspecific 
enthalpies arising from nonenzymatic re- 
actions [112]. On the other hand, the 
fabrication of miniaturized thermomet- 
ric biosensors has become possible due 
to advances in the field of IC technology 
and the micromachining of liquid filters, 
microvalves and micropumps [113]. Minia- 
turized devices are suitable for portable 
use because of their high sensitivity, small 
size, modest buffer consumption, and 
good operational stability [112, 116-119}. 
Enzymes are the most commonly used 
biorecognition element in a thermal 
biosensor; however, in cases where the 
isolation of a pure enzyme is not pos- 
sible, whole cells, organelles or tissue 
slices present a good alternative, although 
they may lack specificity and may also 
respond to some interfering compounds; 
MIPs may represent an alternative choice 
of receptor. Correct immobilization of 
the biorecognition element on the sup- 
port matrix is important to maintain a 
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good catalytic activity of the former. The 
supporting matrix chosen should also be 
mechanically stable to withstand physi- 
cal stress, to allow good flow properties 
[112], and should not interfere with the 
enzymatic reactions. Controlled pore glass 
(CPG), Sepharose CL-6B or CL-4B, Euper- 
git [112], reticulated vitreous carbon (for 
hybrid thermal-electrochemical sensors) 
[115, 120-122] and ceramic hydroxyap- 
atite [123] are the most commonly used 
matrix materials. In general, a large excess 
of enzyme is immobilized on the sup- 
port matrix to ensure correct operational 
stability. 

Enzyme thermistors (ETs) have been 
used to determine a wide range of an- 
alytes, such as ethanol, glucose, oxalate, 
ascorbate, cellobiose and sucrose, and 
penicillin. Thermal biosensors have been 
used for the selective measurement of 
fructose in the presence of glucose [124] 
and to determine levels of urea in adul- 
terated milk [125]. ETs have also been 
used for clinical monitoring; for example, 
a semi-continuous blood glucose moni- 
toring ET device [126] and a cholesterol 
ET sensor [127] have been described. ETs 
have also been used for off-line as well as 
on-line monitoring of bioprocesses such as 
fermentation [128-130]. Heavy-metal ions 
such as Hg*+, Cu?+ and Agt can be moni- 
tored using ETs via their inhibitory actions 
on urease, while pesticides can be deter- 
mined by their inhibitory actions on the 
enzymes acetylcholinesterase and butyl- 
cholinesterase. Apoenzyme-based ETs can 
be used to monitor heavy-metal ions up 
to submillimolar levels [112]; an example 
is that of Cu2+ concentrations in human 
blood sera, which were measured using 
immobilized ascorbate oxidase [131] or 
galactose oxidase [132]. 

MIP-based thermistors have been used 
for the label-free characterization of MIP 


binding and catalysis [133, 134]. Thermo- 
metric transduction has also been cou- 
pled to an ELISA to yield a thermomet- 
ric enzyme-linked immunosorbent assay 
(TELISA) [135] that can be used to de- 
termine the presence of hormones, anti- 
bodies and other biomolecules in complex 
matrices such as fermentation broth, blood 
samples and hybridoma cell media [112]. 
Although the sample capacity and sensi- 
tivity of TELISA is lower than for other 
established techniques (such as radioim- 
munoassay), it offers a faster monitoring 
and can be employed where rapid results 
are desired. 


7 
Microarrays 


A microarray is a high-throughput, 
two-dimensional screening array that 
is located on a glass slide or a silicon 
thin-film cell and can be used to assay large 
quantities of biological materials. The 
concept and methodology of microarrays 
were introduced in 1983; since then the 
technologies of DNA, protein, peptide, 
tissue, cellular, carbohydrate and even 
phenotype microarrays have become 
highly sophisticated and the most used 
worldwide [136]. 


7.1 
DNA Microarray 


A microarray usually contains picomoles 
of a specific sequence as probes, and this 
enables many genetic tests to be executed 
in parallel, simultaneously. As such, DNA 
microarrays have dramatically accelerated 
many types of investigation [137-143]. The 
core principle of DNA (or RNA) microar- 
rays is based on a hybridization between 


Microarray 


Fig. 14. DNA microarray. 


complementary base pairs by strong hy- 
drogen bonds; the total number of fluo- 
rescently labeled target sequences that will 
bind to a probe sequences will depend on 
the amount of target sample and thus pro- 
vide quantitative information relating to 
the target (Fig. 14). The amount of target 
samples to be detected is generally limited, 
however, and additional steps such as the 
polymerase chain reaction (PCR) and tar- 
get labeling with fluorescent dyes can be 
burdensome steps. Consequently, much 
effort has been made to increase the de- 
tection signal by using highly fluorescent 
probing materials such as CPs or inorganic 
quantum dots, and to develop label-free de- 
tection methods in microarrays in order to 
avoid a cumbersome labeling step [55, 56, 
144-150]. 

Whilst a “traditional” solid-state array 
involves a collection of orderly microscopic 
spots called features, each with thousands 
of probes attached onto a surface, an 
“alternative” bead array is a collection 
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Scan 


of microscopic polystyrene beads, each 
with a specific probe and a mixture of 
two or more dyes which do not interfere 
with the fluorescence of the dyes used 
on the target sequence. Depending on the 
number of probes, the types of scientific 
questions being asked and the cost, DNA 
microarrays can be manufactured in dif- 
ferent ways. Some of the most-often used 
technique for DNA microarray manufac- 
ture include printing with fine-pointed 
pins onto glass slides, photolithography 
using pre-prepared masks or dynamic 
micro-mirror devices, ink-jet printing, and 
electrochemistry on microelectrode arrays. 
DNA probes can also be synthesized di- 
rectly onto a microarray substrate (in situ) 
or attached (spotted) via surface engineer- 
ing by a covalent bonding such as silane, 
lysine, or amide chemistry. 

Applications of DNA _ microarrays 
include gene expression profiling, com- 
parative genomic hybridization, GeneID, 
chromatin immuneprecipitation on chip, 
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DamID, single nucleotide polymorphism 
(SNP) detection, exon (junction) arrays, 
fusion gene microarray, and_ tiling 
arrays. 


7.2 
Protein Microarray 


A protein microarray (protein chip) is a 
high-throughput method used to track the 
interaction of large numbers of proteins 
in parallel, and to determine their func- 
tion [151-157]. One critical disadvantage 
of DNA microarrays lies in the fact that 
the quantity of mRNA in the cell often 
does not reflect the expression level of 
the corresponding proteins because pro- 
teins — unlike DNA or mRNA ~ are func- 
tional in cell response. Protein microarrays 
have enabled research groups to study the 
biological interactions at the cell level. The 
protein technology was relatively easy to 
develop as it is based on previously devel- 
oped DNA microarray technology. Similar 
to the DNA microarray, the chip consists of 
a support surface such as a glass slide, ni- 
trocellulose membrane or bead, while the 
probe molecules are typically labeled with 
fluorescent dyes. Nowadays, protein mi- 
croarrays have replaced cumbersome tech- 
niques such as two-dimensional gel elec- 
trophoresis or chromatography, which are 
not suited to the analysis of low-abundance 
proteins and are time-consuming and 
costly. Protein microarrays can be ap- 
plied to proteomics, protein functional 
analysis, antibody characterization, dis- 
ease treatment development such as 
antigen-specific therapies for autoimmu- 
nity, cancer, allergies, for diagnostics such 
as tests for antigen—antibody interaction, 
the discovery of new biomarkers, and the 
monitoring of disease states. 

The surfaces of protein microarrays 
must meet the sophisticated requirements 


of immobilizing protein probes, notably to 
prevent protein denaturation and to pro- 
vide a relevant surface polarity at which 
the binding reaction can occur. There 
is also a need to prevent the nonspe- 
cific binding of other proteins, and to 
minimize the creation of false signals 
from the background noise. Immobiliz- 
ing agents vary from layers of inorganic 
aluminum or gold to organic polymers, 
polyacrylamide gels, or small functional 
moieties such as amines, aldehyde and 
epoxy. Occasionally, thin-film technolo- 
gies such as physical vapor deposition 
(PVD) and chemical vapor deposition 
(CVD) are also used to apply the coating to 
the support surface. Protein array methods 
include ink-jetting, robotic spotting, piezo- 
electric spotting, a drop-on-demand, and 
photolithography [158-162]. The probe 
molecules may be antigens, antibodies, 
aptamers, protein-mimicking peptides, or 
full-length proteins. Recently, an in-situ, 
on-chip synthesis of proteins directly from 
DNA, using cell-free expression systems 
called DAPA (DNA array to protein array), 
PISA (protein in situ array) or NAPPA 
(nucleic acid programmable protein ar- 
ray), was introduced as the proteins in 
an arrayal surface are highly sensitive and 
easily deteriorate, whereas DNA molecules 
are more stable over time and better suited 
to long-term storage. 

The detection methods employed in- 
cluded fluorescence labeling, as well as 
affinity, photochemical or radioisotope tag- 
ging. For label-free detection, SPR, carbon 
nanowire sensors (where detection occurs 
via changes in electronic conductance) and 
microelectromechanical system (MEMS) 
cantilevers can be used. However, these 
systems are ill-suited for high-throughput 
screening and need to undergo further 
development before their future use. 


8 
Conclusions 


During recent years the field of biosensors 
has undergone an evolutionary phase with 
ever-increasing demands for efficient, sen- 
sitive and robust sensors in the fields of 
clinical diagnostics, medicine and drugs, 
process control, and environmental moni- 
toring. In this chapter, attention has been 
focused on the most basic principles of 
biosensory sciences, and has hopefully 
proved valuable to the reader. Electro- 
chemical biosensors are the most widely 
used sensing devices, given their high 
efficiency and ease of operation. Optical 
techniques such as SPR and SER have 
also found widespread use in research 
and development, while biosensors based 
on conducting polymers are beginning to 
open up a new field of hand-held biosen- 
sors that involve internal transduction 
mechanisms and can respond to analyte 
recognition through color changes. While 
acoustic resonance devices seem to be a 
good choice for bioaffinity sensors, with 
some noteworthy advances having been 
made in this field, thermal biosensors do 
not yet appear to have made any serious 
practical impact. 

During the past decade, the devel- 
opment of biosensors has been greatly 
spurred by advancements made in the 
materials sciences. For example, nanos- 
tructured metal oxides have been shown 
to provide an effective immobilization of 
biomolecules with desired orientation and 
conformation, resulting in better sens- 
ing characteristics [163]. While silver and 
gold nanoparticles are widely used as elec- 
trochemical labels in amperometric im- 
munoassays, metal quantum dots have 
been used as multilabels for affinity reac- 
tions [4]. Carbon nanotubes (CNTs) have 
also proved to be a material of choice 
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for electrode fabrication, due to their 
semiconductive behavior and high poros- 
ity [164]. Indeed, amperometric biosen- 
sors comprising a CNT-modified elec- 
trode have shown an enhanced reactiv- 
ity of NADH and hydrogen peroxide at 
the electrode [4]. Graphene, with its one 
atom-thick single graphitic layer, has at- 
tracted much scientific interest due to its 
unique physico-chemical properties such 
as high surface area and excellent ther- 
mal and electric conductivities. Given its 
excellent electron transport properties and 
high surface area, functionalized graphene 
is expected to help in the direct electron 
transfer between the electrode substrate 
and enzymes, and thus aid in the design 
of mediator-free biosensors with poten- 
tially better sensing parameters. Whilst 
conducting polymers have been used suc- 
cessfully as a material for the immobiliza- 
tion of biomolecules, as well as provid- 
ing enhanced electron transfer properties, 
they have also been shown to function 
as stand-alone sensors as their emission 
properties are influenced by their molecu- 
lar environment. 

With the principles of biosensing and 
transduction mechanism having been well 
established, attention in the field of biosen- 
sors has now been focused on miniaturiza- 
tion, and this has resulted in smaller, more 
sensitive, and more easily affordable de- 
vices. Microchip technology has helped to 
concentrate electronic circuits onto a sin- 
gle chip through embedded ICs. However, 
in spite of the technological innovations 
and improvements, the miniaturization of 
these devices poses technical challenges 
as they lack sensitivity, long-term stability 
and robustness for their intended appli- 
cations. Nonetheless, newer technologies 
such as silicon microsensors, fiber-optic 
biosensors and cell-on-chip sensors are 
increasingly being investigated [165]. 
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Further developments in the field of 
biosensors will most likely be effected by 
the emergence of personalized medicine, 
as escalating healthcare costs continue to 
force the development of a new generation 
of wearable, integrated and less-invasive 
sensors [4]. In addition to healthcare and 
clinical diagnostics, industrial processes 
and environmental monitoring will also 
continue to press for more efficient, sen- 
sitive and robust biosensors. Moreover, 
with the increasing risks of biological and 
chemical warfare, security and biodefense 
will also require new, innovative and effi- 
cient biosensors to meet these challenges. 
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Keywords 


Synthetic biology 

An area of biological research combining biology and engineering in the design and 
construction of novel biological parts, devices, and systems, including the redesign of 
existing natural biological systems for useful purposes. 


Vitamin By 
Also known as cobalamin, is a water-soluble vitamin that is synthesized exclusively by 
some bacteria and archea and is required by humans for a variety of metabolic processes. 


Substrate channeling 
The process in which an intermediate is transferred between the active sites of two 
enzymes that catalyze sequential reactions, without its release into solution. 


Cobalamin biosynthesis methyltransferases 
Enzymes that transfer a methyl group from S-adenosyl methionine (SAM) to an 
intermediate in the cobalamin biosynthetic pathway. 


Bacterial microcompartments 
Cellular compartments composed of a proteinaceous shell that is associated with a 
particular metabolic pathway. 


Synthetic Biology in Metabolic Engineering 


Vitamin By (cobalamin) is a remarkable nutrient not only because of its structural 
complexity but also because it is only synthesized by certain bacteria. In order to 
understand its biosynthesis and to enhance its production, metabolic engineering 
and synthetic biology strategies have been applied to elucidate this metabolic process; 
the results of which have shown that intermediates in the pathway are passed from 
one enzyme to the next by substrate channeling. Knowledge of the pathway is also 
being used in the design of vitamin analogs that have potential as drug-delivery 
vehicles. Once its synthesis is complete, cobalamin is required as either a coenzyme 
or cofactor ina number of different metabolic processes. Some cobalamin-dependent 
enzymes are found encased within bacterial microcompartments, proteinaceous 
organelles that house a specific metabolic pathway. The potential to develop 
these supra-macromolecular structures into bespoke bioreactors by replacing the 


embedded pathway is discussed. 


1 
Industrial Production of Vitamin Bj2 


The cost efficiency of the synthesis of 
a commercially valuable small molecule 
is the key determinant for its indus- 
trial production. A chemical synthesis of 
a small molecule is often the preferred 
choice because of its speed, control, and 
ease of purification, although the yield 
must be high for exploitation. However, 
many natural products — including amino 
acids, antibiotics and nutrients — are ex- 
tracted from natural sources, especially 
if the desired bioproduct is too complex 
or is inefficient to synthesize chemically. 
Therefore, natural product synthesis of 
small molecules derived from primary 
and secondary metabolism represents a 
major research area in modern molecu- 
lar biotechnology and medicine. In this 
chapter, an exploration will be made of 
the biochemistry underpinning one of the 
most complicated pathways found in Na- 
ture, namely the biosynthesis of biological 
forms of vitamin By, as well as investiga- 
tions of the processes that Nature employs 
to enhance the synthesis of this nutri- 
ent through pathway channeling. How 
cobalamin itself is also associated with 


compartmentalization within prokaryotic 
cells will also be explored. 

Vitamin Biz is the cure for perni- 
cious anemia. The molecular structure, 
determined using X-ray crystallography, 
revealed the vitamin to be a modified 
cobalt-containing tetrapyrrole (Fig. 1) of 
which there are two major biological 
forms: (i) a cofactor form, methylcobal- 
amin, which is used in methyltransfer 
reactions; and (ii) a coenzyme form, 
adenosylcobalamin, which is involved in 
rearrangement processes. What makes 
cobalamin unique among all other vi- 
tamins is that it is synthesized solely 
by prokaryotes, with the biosynthesis be- 
ing restricted to certain members of the 
Eubacteria and the Archaea. Wild-type 
bacteria synthesize low levels of cobal- 
amin, which is used for endogenous 
Bi2-dependent enzymes or, alternatively, 
for symbiotic interactions with other or- 
ganisms [1]. The biosynthesis of cobal- 
amin involves one of the most complex 
pathways found in Nature. As a modified 
tetrapyrrole, adenosylcobalamin belongs 
to the same family of compounds as heme 
and chlorophyll, the so-called “pigments 
of life.”” The synthesis of cobalamin, how- 
ever, is more complicated than that of 
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OH 


Fig. 1 


The structure of adenosylcobalamin. The carbons are 


numbered in red, the pyrrole rings in maroon, and the side 


chains in green. 


these porphyrin derivatives because the 
tetrapyrrole must undergo a contraction 
process to produce a corrin nucleus. This 
also involves decoration of the ring pe- 
riphery with methyl groups, derived from 
S-adenosyl-t-methionine (SAM), cobalt in- 
sertion and amidations, as well as the 
attachment of upper and lower ligands 
for the metal ion. 

The chemical synthesis of vitamin B12 
requires approximately 70 chemical steps 
[2], and was achieved after a monumental 
struggle by the combined efforts of Wood- 
ward and Eschenmoser in 1973 [3]. This 
remains one of the great feats of synthetic 
chemistry because of the enormity of the 
complex synthesis, which required over 
a decade of planning and research and 
yielded only a few precious milligrams of 


the crystalline product. Needless to say, 
the chemical synthesis of cobalamin is not 
a commercial option! Nonetheless, annu- 
ally, around 30tons of vitamin By are 
produced by industry for vitamin supple- 
mentation and animal feed additives [4]. 
A variety of enhanced bacterial strains 
exist which are capable of producing vita- 
min Bj72 on a commercial scale, although a 
genetically modified Pseudomonas denitrif- 
icans strain is now the dominant variant. 
The major western producer of vitamin B12 
was Rhone-Poulenc Rorer (RPR), France 
which, after several buyouts/mergers, is 
now owned by Sanofi-Aventis. P. deni- 
trificans is an aerobe that is grown on 
glucose with high oxygenation [5]. When 
growth is completed, the culture is heated 
in the presence of cyanide, whereby the 


heat treatment breaks open the cells and 
precipitates the majority of the protein 
material, while cyanide provides an up- 
per axial ligand to the cobalamin, forming 
cyanocobalamin. The mixture is then clar- 
ified before vitamin By is extracted by 
chemical precipitation. At this stage of 
the purification the nutrient is suitable 
for a livestock feed additive. However, 
to achieve a higher purity a mixture of 
chromatographic techniques and chemical 
extraction is used to produce 99.9% pure 
vitamin By which is sold commercially for 
food and vitamin supplementation. 
Attention will now be focused on how 
P. denitrificans became the dominant 
commercial producer of vitamin By. 
During the early 1980s, a strain named P. 
denitrificans SC510 reportedly produced 
50-100 mg of cobalamin per liter of cul- 
ture [4]; this production level was achieved 
after a decade of random mutagenesis and 
selection. Here, a good producing strain is 
used as the starting point and exposed to a 
mutagen; individual cells are then selected 
and screened for potential advantageous 
properties such as higher levels of desired 
bioproduct, faster growth on cheap sub- 
strates (sugar beet, glycerol), or increased 
biomass. Upon the selection of a lead 
strain, a new round of mutagenesis and 
screening is performed, and this process 
is repeated until a target is reached. 
Similar approaches have led to the opti- 
mization of clavulanic acid (f-lactamase 
inhibitor) production in Streptomyces 
clavuligerus, in a strain developed by 
GlaxoSmithKline during the early 1990s 
[6]. Unfortunately, random mutagenesis 
is time consuming, involving endless 
datasets and a host of mutated strains to 
store. While it is possible to gain some 
scientific understanding by mapping of 
the mutations to a desired phenotype 
by DNA sequencing, this is, however, 
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impractical for commercial application. 
Nonetheless, recent developments in 
synthetic biology have led to this method 
being used for the fine-tuning of rationally 
designed synthetic components [7]. 

Returning to the commercial vitamin 
Biz story, a research group at RPR took 
on the development of the commercial 
P. denitrificans SC510 strain in the mid 
1980s, and were able to exploit the strain 
and unravel the biochemistry of vitamin 
Biz biosynthesis. In conjunction with aca- 
demic groups at Cambridge and Texas A 
& M, together they were able to elucidate 
the de-novo aerobic biosynthetic route for 
cobalamin synthesis [8] (Fig. 2). This was a 
significant achievement since at this time 
the biosynthetic steps were unknown and 
little was known of the molecular genetics 
of the organism, let alone its metabolic en- 
gineering. Moreover, it is now appreciated 
that many of the pathway intermediates 
are highly sensitive to molecular oxygen, 
which makes their handling and analysis 
technically challenging [9]. 

The RPR team was able to isolate the 
main cobalamin (cobl) operon in P. deni- 
trificans, revealing that it encoded eight 
genes — cobF-cobG-cob H-cobI-cob]-cobK- 
cobL-cobM -—that encoded enzymes 
responsible for the transformation of 
precorrin-2 into hydrogenobyrinic acid 
(HBA) [10]. The team then developed 
a P. denitrificans cobI knockout strain 
by homologous recombination _ that 
lacked these genes, therefore rendering 
it unable to produce cobalamin. When 
individual genes or combinations were 
reintroduced on a separate plasmid for 
overexpression into the P. denitrificans 
cobI knockout strain, the effects of these 
variant plasmids were tested by direct 
13¢ nuclear magnetic resonance (NMR) 
labeling experiments. This determined 
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ig. 2. Overview of the adenosylcobalamin biosynthetic pathways. The modern pathway, also called late cobalt 


insertion or aerobic, is highlighted in red. The ancestral pathway, also called early cobalt insertion or anaerobic, is 
highlighted in green. Black boxes denotes the joint steps between the two pathways. 


the number of methylation events and 
their respective position on the macro- 
cycle, allowing each methylation activity 
to be assigned to a specific methyltrans- 
ferase enzyme as follows: Cobl > C20; 
CobJ + C17; CobM-—> C11; CobF > C1; 
and CobL— C5 and C15 [8]. In addition, 
other experiments verified the activities 
of the monooxygenase CobG [11], the re- 
ductase CobK [12], and the methylmutase 
CobH [13]. This allowed for the elucida- 
tion of the aerobic biosynthesis of HBA 
from precorrin-2 [8]. This initial research 
on corrin ring biosynthesis laid the foun- 
dation for further studies that provided 
a complete understanding of corrin ring 
synthesis through establishment of the 
cobalt chelation step, catalyzed by Cob- 
NST [14], the amidation enzymes CobB 
and CobQ [15], and the cobalt(II) reductase 
[16]. Based on this knowledge, the research 
groups were able to target the amplifica- 
tion of particular genes from within the 
biosynthetic process to enhance cobalamin 
biosynthesis. Most notably, a 30% gain in 
Biz yield was recorded upon increased lev- 
els of the cobI operon, while the combined 
overexpression of the cobA and cobE genes 
led to a further 20% increase [17]. More- 
over, additional characterization of the 
terminal steps of the nucleotide loop syn- 
thesis and attachment revealed that this 
was a rate-limiting step for commercial 
cobalamin production in P. denitrificans 
[17]. Strategies to overcome this limita- 
tion were investigated and resulted in the 
cloning of genes from Rhodobacter capsu- 
latus heterologously in P. denitrificans, to 
provide a more efficient system [17]. 

Due to the success of this approach, 
P. denitrificans currently dominates the 
Biz industrial market [4]. It has been re- 
ported that P. denitrificans SC510 produces 
approximately 200mg of cobalamin per 
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liter of culture under high cell density fer- 
mentation [18, 19], although it is rumored 
to reach almost double this level under in- 
dustrial fermentation conditions [4]. The 
story of cobalamin production provides a 
fascinating account of the early trials and 
application of synthetic biology, before the 
scientific concept was actually appreciated. 


2 
Modular Assembly of Cobalamin 


Cyanocobalamin (vitamin Bj) is an unnat- 
ural variant of the active coenzyme form of 
cobalamin, adenosylcobalamin, produced 
by the chemical replacement of cyanide 
for the upper adenosyl group during the 
purification process. Adenosylcobalamin 
is often referred to as the most beautiful 
pigment of life [20], and its synthesis as 
the “Mount Everest of biosynthetic prob- 
lems” [21]. Overall, at least 30 enzymes 
are required for this synthesis de novo, 
which is restricted to a number of prokary- 
otes. For example, a bacterium such as 
Salmonella enterica requires an astonish- 
ing 0.6% of its genome for cobalamin 
biosynthesis (calculated from all known 
genes involved in the synthesis by com- 
parison to the whole genome). Yet, such a 
massive investment for a bacterial organ- 
ism indicates that the ability to produce 
this nutrient must provide a significant 
metabolic advantage. What has become 
clear is that Nature has evolved two simi- 
lar — but distinct — metabolic pathways for 
cobalamin biosynthesis (Fig. 2) which dif- 
fer in their requirements for oxygen and in 
the timing of cobalt insertion, and which 
are often referred to as the cobalt-late (aer- 
obic) or cobalt-early (anaerobic) pathways. 
In fact, by comparing bacterial genomes, 
the picture becomes more complicated 
as many bacterial strains seem to have 
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evolved their own method of producing 
cobalamin, based on one of the two main 
“template” pathways. These two pathways 
are described in the following sections. 


2.1 
Cellular Resource Procurement 


The construction of cobalamin requires 
the cellular provision of a number of 
starting metabolites, cofactors, and metals 
(Fig. 3) such as glycine, succinyl-CoA, 
L-glutamate, pyridoxal phosphate (PLP) 
or pyridoxamine phosphate (PMP), 
oxygen, cobalt, t-glutamine, 1t-threonine, 
L-methionine, ATP, GTP, magne- 
sium, FAD (flavin), f-nicotinic acid 
mononucleotide (NAMN), nicotinamide 
adenine dinucleotide reduced (NADH), 
nicotinamide adenine dinucleotide 
phosphate reduced (NADPH), erythrose, 
and formate. 


2.2 
Building Blocks 


In order to build one corrin macrocycle, 
eight molecules of 5-aminolevulinic acid 
(ALA) and eight molecules of SAM are 
required (Fig. 3). 


2.2.1. Building Block 1: ALA 

While most bacteria, including  £Es- 
cherichia coli, will synthesize ALA from 
L-glutamate using the C5 pathway, some 
bacteria — especially those belonging to 
the a-Proteobacteria, including Brucella 
melitensis, Rhodobacter sphaeroides, and 
Sinorhizobium meliloti -follow the 
Shemin or C4 pathway where ALA is pro- 
duced from glycine and succinyl-CoA. The 
C5 pathway requires three consecutive 
steps to create ALA. First, t-glutamate is 
converted to glutamyl-'RNA and reduced 
to t-glutamate 1-semialdehyde by the 


glutamyl-'RNA reductase HemA®, using 
NADPH as a cofactor. ALA is then pro- 
duced by the t-glutamate 1-semialdehyde 
aminotransferase HemL® via an 
isomerization reaction, using PMP as 
a cofactor. In contrast, the C4 pathway 
is a one-step reaction catalyzed by ALA 
synthase (HemA“™), using PLP as a 
cofactor. 


2.2.2 Building Block 2: SAM 

SAM is a common cofactor that is involved 
in methyl group transfer reactions with 
SAM-dependent methyltransferases. It is 
synthesized from 1-methionine and ATP 
in a reaction catalyzed by SAM synthetase 
(MetK). 


2.3 
Synthesis of the Tetrapyrrole Macrocycle, 
Uroporphyrinogen III (Uro’gen III) 


The transformation of ALA into uro’gen 
III is catalyzed by three enzymes: ALA 
dehydratase (HemB); porphobilinogen 
deaminase (HemC); and uro’gen III 
synthase (HemD). These three steps 
are found in all bacteria that produce 
tetrapyrroles. HemB_ catalyzes the 
condensation of two molecules of ALA to 
form the monopyrrole porphobilinogen. 
This homo-octameric enzyme normally 
requires a catalytic metal ion, either 
zinc or magnesium, to help promote 
the Knorr-type condensation reaction 
between the two amino ketone substrate 
molecules. The next enzyme, HemC, 
utilizes a unique dipyrromethane cofactor 
to initiate the polymerization process 
of four PBG pyrrole units to generate 
a linear tetrapyrrole termed either hy- 
droxymethylbilane or preuroporphyrinogen, 
with the release of four molecules of 
ammonia. This highly unstable bilane is 
passed to the next enzyme in the pathway, 
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HemD, which catalyzes the cyclization of 
hydroxymethylbilane together with the 
inversion of its ring D to produce the type 
III isomer of uro’gen. 


2.4 
Different Molecular Assemblies 


From uro’gen III to cobinamide, at least 15 
steps are required and two distinct routes 
can be followed, reflecting the evolution 
of the microorganisms and their habitats. 
As mentioned above, the main differ- 
ences between these two pathways relate 
to a requirement for molecular oxygen 
and the timing of cobalt insertion. Con- 
sequently, the aerobic pathway generally 
has metal-free intermediates, whereas the 
anaerobic pathway has cobalt-containing 
metabolites. 


2.4.1 Section 1: The Last Common Step; 
Uro’gen III to Precorrin-2 
Uro’gen III acts as the first branch- 
point in the biosynthesis of modified 
tetrapyrroles. To direct this intermediate 
towards the biosynthesis of siroheme and 
the alternative heme pathway, it under- 
goes a bis-methylation event where two 
SAM-derived methyl groups are added 
consecutively to positions 2 and 7 of 
the macrocycle. This reaction is catalyzed 
by an enzyme called SAM-dependent 
uro’gen III methyltransferase (SUMT), 
CysG4/CobA. The enzyme produces dihy- 
drosirohydrochlorin, which is also known 
as precorrin-2. The intermediates associ- 
ated with the biosynthesis of the cor- 
rin ring are normally termed precorrin-n, 
where n refers to the number of methyl 
groups that have been added to the 
macrocycle. 

At this point, the two cobalamin biosyn- 
thetic pathways diverge and rejoin again 
only after a further 10 steps. The early 


cobalt insertion pathway is often referred 
to as the anaerobic pathway as it was orig- 
inally observed in anaerobes, whilst the 
late cobalt insertion pathway was termed 
aerobic as it was shown to require molec- 
ular oxygen. However, whilst the soil 
bacterium Bacillus megaterium can op- 
erate an early cobalt insertion pathway 
(anaerobic) in the presence of oxygen [22], 
the cobalt-late pathway (aerobic) does not 
seem to operate under anaerobic condi- 
tions. This suggests that the cobalt-early 
pathway was the first to evolve. For sim- 
plicity, the early cobalt insertion route will 
be named the ancestral pathway, while the 
late cobalt insertion route will be named 
the modern pathway. 


2.4.2 Section 2a: The Ancestral Pathway; 

Precorrin-2 to Cobyrinic Acid a,c-Diamide 

For the ancestral pathway, precorrin-2 
is acted upon by a precorrin-2 dehy- 
drogenase (SirC) that uses NAD* as 
cofactor to transform it into sirohydrochlo- 
rin through the removal of two protons 
and two electrons. This oxidized version 
of precorrin-2 is sometimes referred to 
as factor II which, in turn, is the sub- 
strate for the cobalt chelatase generat- 
ing cobalt-sirohydrochlorin (cobalt-factor 
II) [23]. Interestingly, a range of differ- 
ent cobaltochelatases are found associated 
with this pathway that, despite lacking sig- 
nificant amino acid sequence similarity, 
share the same overall tertiary structure. 
This structure is also similar to the fer- 
rochelatase (HemH) that is associated with 
heme biosynthesis [24, 25]. The differ- 
ent cobaltochelatases include CbiK, Cbix!, 
and Cbix®. Of these enzymes, CbiXS is the 
smallest and it is apparent that both Cbix' 
and Cbik have arisen through a gene du- 
plication and fusion event linked with 
chix’. This would seem to suggest that 


CbiX® represents a primordial chelatase 
[26, 27]. 

Once cobalt has been inserted into 
the macrocycle, cobalt-factor II is sub- 
jected to a third methylation at posi- 
tion C20 to produce cobalt-factor III. 
This SAM-dependent reaction is catalyzed 
by CbiL [28]. Cobalt-factor III is then 
methylated at position C17 by CbiH, 
which introduces a delta lactone on 
ring A, and induces a contraction of 
the macrocycle that results in the pro- 
duction of cobalt-precorrin-4 [29]. The 
transformation of cobalt-precorrin-4 into 
cobalt-precorrin-5A is catalyzed by the 
C11 SAM methyltransferase, CbiF. The 
delta lactone ring of cobalt-precorrin-5 is 
opened by the action of CbiG, and the 
two-carbon unit, representing the origi- 
nal methylated C20 position, is released 
as acetaldehyde [30]. The formation of 
cobalt-precorrin-5B is followed by a further 
SAM-dependent methylation whereby the 
sixth methyl group is added to position 
C1, producing cobalt-precorrin-6A in a re- 
action catalyzed by CbiD [31]. 

Up until this point, all of the previous 
methyl transfers, at C2, C7, C20, C17, 
and C11, had been performed by en- 
zymes that are structurally very related and 
which belong to the class III methyltrans- 
ferase family. However, CbiD is not part 
of this family and, indeed, the structure of 
Archaeoglobus fulgidus CbiD (PDB_1SR8) 
indicates that the protein lacks similar- 
ity to any characterized enzyme in the 
database. In this respect, CbiD is a novel 
methyltransferase. 

Cobalt-precorrin-6A is reduced 
by the action of CbiJ to produce 
cobalt-precorrin-6B through the addition 
of two protons and two electrons [12]. 
Next, the multifunctional enzyme 
CbiET methylates cobalt-precorrin-6B at 
positions C5 and C15, and decarboxylates 
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the acetate side chain attached to the 
C12. This generates cobalt-precorrin-8 
and concludes the methylation decoration 
around the corrin ring [32]. In some 
microorganisms, such as S. enterica, 
CbiET is found as two enzymes, CbiE and 
CbiT [33]. While both enzymes catalyze 
a SAM-dependent methyl transfer, only 
CbiE is the class II methyltransferase. 
CbiT belongs to the class I methyltrans- 
ferases [34]. It is likely that CbiT catalyzes 
decarboxylation of the acetate side chain 
on C12 and methylates the C15 position 
to generate cobalt-precorrin-7, whereas 
CbiE methylates the C5 position [35]. 

The conversion of cobalt-precorrin-8 
into cobyrinic acid is catalyzed by the 
methyl isomerase CbiC, which moves the 
methyl group from C11 to C12 in a 
sigmatropic-type rearrangement reaction 
[13, 36]. Next, an enzyme (CbiA) amidates 
the acetic acid side chains attached to C2 
and C7, the a and c side chains, using 
L-glutamine as an amido donor in an 
ATP-dependent reaction, to give cobyrinic 
acid a,c-diamide [37, 38]. It is at this point 
that the two divergent corrin-biosynthetic 
pathways converge again. 


2.4.3. Section 2b: The Modern Pathway; 

Precorrin-2 to Cobyrinic Acid a,c-Diamide 

In the modern pathway, the methylations 
occur in the same order as previously 
described for the ancestral pathway. The 
transformation of precorrin-2 is initiated 
by the addition of a methyl group at 
position C20 by the SAM methyltrans- 
ferase Cob] enzyme [39]. Precorrin-3A is 
then hydroxylated by a monooxygenase, 
CobG, that contains an Fe-S center and 
a non-heme iron, and utilizes molecular 
oxygen as a substrate. This results in the 
formation of a gamma lactone on ring A, 
leading to the generation of precorrin-3B. 
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Interestingly, in R. capsulatus the transfor- 
mation of precorrin-3A into precorrin-3B 
is mediated by a flavor-hemoprotein called 
CobZ [40]. The next step in the path- 
way involves attachment of a methyl 
group at C17 by CobJ, a reaction that 
promotes the extrusion of the methy- 
lated C20 position (but which remains 
attached to C1). Precorrin-4 therefore 
represents the first ring-contracted inter- 
mediate [8, 41]. The transformation of 
precorrin-4 into precorrin-5 is catalyzed 
by the C11 SAM-dependent methyltrans- 
ferase, CobM. The sixth methyl transfer, 
involving the transfer of a methyl group to 
C1, is catalyzed by a typical class III SAM 
methyltransferase, CobF. To achieve this, 
the enzyme must first remove the extruded 
two-carbon unit linked to C1, which is lost 
as acetic acid. Overall, this results in the 
formation of precorrin-6A [8]. 

As with the ancestral pathway, 
precorrin-6A is reduced to precorrin-6B 
in a reaction requiring NADPH. The 
multifunctional CobL, which is equivalent 
to CbiET, and likely represents a fusion 
of two different methyltransferases, then 
transforms precorrin-6B into precorrin-8 
by methylation at C15 and decarboxylation 
of the acetic acid side chain attached to 
C12, to give precorrin-7, followed by a fur- 
ther methylation at C5. The next enzyme, 
CobH, then promotes a rearrangement 
of the methyl group from C11 to C12 
to generate the distinctly orange-colored 
HBA. Amidation of the a and c acetic 
acid side chains by CobB generates 
hydrogenobyrinic a,c-diamide (HBAD), 
and it is this amidated intermediate that 
acts as the substrate for metal chelation. 

The late cobaltochelatase is an 
ATP-dependent multimeric complex that 
consists of three subunits, CobN, CobS, 
and CobT [14]. CobN is the monomeric 
catalytic entity of the chelatase, binding 


cobalt and HBAD, while CobS and CobT 
form a chaperone-like complex, which 
belongs to the AAA(+) superfamily of 
proteins [42]. It is fascinating to note 
that a very similar chelatase system is 
found in the system for the insertion 
of magnesium into chlorophyll, with 
the BchH and BchI-BchD forming the 
magnesium chelatase [43]. The similarity 
of these two heteromeric chelatases 
strongly suggests that these two enzymes 
have evolved from a common ancestor. 
The product of the CobNST reaction is 
cobyrinic acid a,c-diamide. It is at this 
point that the two pathways rejoin. 


2.4.4 Section 3: Both Pathways, from 
Cobyrinic Acid a,c-Diamide to 
Adenosyl-Cobinamide 

The initial step in this stage of the 
pathway is an addition of the upper 
adenosyl group to the centrally chelated 
cobalt ion. To achieve this, the cobalt 
ion is reduced to a cobalt(I) species that 
acts as a strong nucleophile to form 
the cobalt-carbon bond. The reduction 
reaction can be catalyzed by flavodoxin 
(FldA) or by a more specific flavoprotein 
corrin reductase (CobR) [44]. Itis the action 
of the adenosyltransferase (CobA/CobO) 
that transfers the adenosyl group from 
ATP to the cobalt ion [45, 46]. The 
adenosyltransferase accepts a wide range 
of substrates, including cobyrinic acid 
a,c-diamide, cobyric acid, cobinamide, and 
cobalamin, allowing bacteria to scavenge 
many different intermediates [46]. 

Next, the second round of peripheral 
amidations takes place, this time at the 
e, d, b, and g carboxyl groups [47]. 
The enzyme responsible (CbiP/CobQ) 
for these amido transfers is similar to 
the previous amidases (CbiA/CobB), and 
also utilizes ATP and t-glutamine as 
amino donor. Here again, these enzymes 


appear to have evolved from a common 
ancestor [48]. The product of this reaction 
is adenosylcobyric acid, which now has 
only a single free carboxyl side chain 
linked to C17. It is at this position that 
the amino-O-2-propanol phosphate (APP) 
linker is attached. This amide attachment 
is catalyzed by CbiB/CobD and requires 
ATP and magnesium [33, 38]. This results 
in the synthesis of adenosylcobinamide 
(phosphate), with the corrin ring now 
ready for the final assembly. The synthesis 
of APP itself will be described in the next 
section. 


Z9 
Biosynthesis of the Lower Axial Ligand, 
a-Ribazole Phosphate 


The lower axial ligand is composed of an 
unusual base known as dimethylbenzim- 
idazole (DMB), that is linked to ribose 
phosphate. The synthesis of DMB is medi- 
ated via one of two distinct routes reflect- 
ing the living conditions of the bacterium, 
from anaerobic or aerotolerant to aero- 
bic environments. The anaerobic route 
to produce DMB, which occurs in bac- 
teria such as Eubacterium limosum, has 
not been fully elucidated but is known to 
require erythrose, formate, t-glutamine, 
L-glycine, and t-methionine as substrates 
[49]. An alternative route for DMB syn- 
thesis is found in aerotolerant or aerobic 
bacteria, such as Bacillus megaterium or 
Sinorhizobium meliloti, where the synthe- 
sis is achieved in a one-step reaction that 
sees the transformation of FMNH) into 
DMB (Fig. 3). This reaction is catalyzed by 
BluB, which has been nicknamed a “flavin 
destructase,” and requires molecular oxy- 
gen [50, 51]. Finally, transfer of the phos- 
phoribosyl moiety of NaMN onto DMB is 
catalyzed by the NaMN:5,6-DMB phospho- 
ribosyltransferase (CobT/CobU) [52, 53]. 


Synthetic Biology in Metabolic Engineering 


The reaction products are nicotinate and 
a-ribazole phosphate. 


2.6 
Biosynthesis of 
Aminopropanol-O-2-Phosphate (APP) 


In S. enterica, the synthesis of APP 
is a two-step reaction and starts with 
t-threonine, which is phosphorylated by 
the t-threonine kinase PduX [54]. The 
second step is the decarboxylation of 
the t-threonine-O-3-phosphate catalyzed 
by the PLP-dependent CobD enzyme [55, 
56]. 

It is interesting to note that pduX is the 
last gene of the 1,2-propanediol utiliza- 
tion (pdu) operon, and that 1,2-propanediol 
catabolism is dependent on adenosylcobal- 
amin. Moreover, the pdu and cob (cobal- 
amin) operons are juxtapositioned in the 
genome but are transcribed in opposite di- 
rections, although they share the same reg- 
ulatory elements [57]. Another intriguing 
fact is that Salmonella can use propanediol 
as the sole carbon source in the pres- 
ence of oxygen, but adenosylcobalamin 
is produced only in anaerobic conditions. 
However, Salmonella can produce cobal- 
amin in aerobic conditions from cobyric 
acid, the substrate of the APP attachment 
[57]. 


2.7 
Final Assembly 


CobU/CobP is a bifunctional enzyme ex- 
hibiting both cobinamide kinase and cobi- 
namide phosphate guanylyltransferase ac- 
tivities. This enzyme phosphorylates the 
aminopropanol moiety of cobinamide, us- 
ing ATP if the phosphate group is miss- 
ing, after which the newly added phos- 
phate displaces the pyrophosphate from 
GTP to yield adenosyl-GDP-cobinamide 
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[58, 59]. Finally, the enzyme CobS/CobV 
catalyzes an exchange of the GDP moi- 
ety of adenosyl-GDP-cobinamide for the 
a-ribazole phosphate, thus completing 
the lower loop assembly and generating 
adenosyl cobalamin phosphate [53, 60]. 
It is unclear at what stage the phos- 
phate is extruded, but in S. enterica it was 
shown to be the penultimate step, using 
CobC [61]. 


3 
Nature Has Evolved Different Ways to 
Catalyze Ring Contraction 


The marked difference between cobal- 
amin and other modified tetrapyrroles is 
the absence of the C20 carbon, which 
is the result of the ring-contraction pro- 
cess. During the cobalt-late (aerobic) path- 
way, ring contraction is achieved by 
an oxygen-dependent process involving 
two enzymes (Fig. 2). The first reac- 
tion — the conversion of precorrin-3A into 
precorrin-3B — is catalyzed by a monooxy- 
genase that springloads the molecule for 
contraction by hydroxylating the C20 po- 
sition and generating a gamma lactone. 
The second reaction -the conversion of 
precorrin-3B into precorrin-4 is catalyzed 
by the methyltransferase Cob] and in- 
volves the methylation of C17, which 
is coincident with the actual contrac- 
tion of the ring. The hydroxylation of 
the C20 can be catalyzed by two very 
distinct precorrin-3B synthases, isolated 
from different organisms, termed CobG 
and CobZ. CobG, a soluble non-heme 
iron-containing enzyme [10] and Cob2Z, 
a cofactor-rich, membrane-associated pro- 
tein which is only found in R. capsula- 
tus [62]. 

CobG was identified from sequencing 
of the cobalamin biosynthetic operon in 


P. denitrificans, and encodes a protein 
of 459 amino acids (~46kDa) [10]. The 
function of CobG in cobalamin synthe- 
sis was demonstrated when it was shown 
to convert precorrin-3A into a compound 
with an extra 16 mass units, correspond- 
ing to the introduction of an oxygen atom 
[63]. Labeling experiments with '8O, con- 
firmed that one atom of oxygen was in- 
corporated into the C20 position [64, 65). 
Subsequent experiments proved conclu- 
sively that the latter was derived from 
molecular oxygen, and that CobG is there- 
fore a monooxygenase [11]. Oxygenases 
generally utilize an organic cofactor, and 
are either heme-dependent or non-heme 
iron-dependent in order to catalyze the 
unfavorable reaction of oxygen with an 
organic substrate [66]. Further studies 
showed, indeed, that CobG houses a 
[4Fe-4S] center as well as a non-heme 
iron. The latter displays the site for ac- 
tivation of molecular oxygen [40]. The 
enzyme mediates a two-electron oxida- 
tion, whereby the [4Fe-4S] center most 
likely feeds the reducing equivalents to 
the non-heme iron [7, 11]. CobG shares 
sequence similarity with assimilatory sul- 
fite and nitrite reductases; these enzymes 
are involved in the reduction of sulfites to 
sulfides for the incorporation into amino 
acids and cofactors. This six-electron re- 
duction requires both a [4Fe-4S] clus- 
ter and siroheme for activity. In sulfite 
reductase the two cofactors are cova- 
lently coupled through a shared cysteine 
residue [67]. 

It has been suggested that, due to its 
comparatively simple synthesis, siroheme 
(a reduced porphyrin) is an ancient co- 
factor. CobG does not require siroheme, 
but its substrate, precorrin-3A, is struc- 
turally similar. It can be considered that 
precorrin-3A binds in a similar position to 
CobG, thereby replacing siroheme. From 


an evolutionary perspective, however, it 
is probable that sulfite reductases and 
CobG have arisen through a process of 
patchwork evolution, where both have re- 
tained a shared protein framework but are 
specialized to catalyze different processes 
(68, 69]. 


Genome sequencing studies have 
suggested that most bacteria  syn- 
thesizing adenosylcobalamin via the 


oxygen-dependent pathway catalyze the 


monooxygenation step using CobG. 
However, an exception to this is 
the a-proteobacterium R. capsulatus, 


which harbors the genes for the 
oxygen-dependent vitamin By. pathway 
[70], although the monooxygenase cobG is 
substituted by a gene termed cobZ [62]. 
Interestingly, CobZ shares no sequence 
similarity to CobG, and provides another 
example of an enzyme acquisition, though 
from a different ancestor to CobG. The 
enzyme reveals similarity to two proteins 
found in the tricarballylate operon of S. 
enterica [71]. 

CobZ is a protein of 86.6kDa and is 
composed of two distinct moieties. The 
N terminus harbors a flavin-binding re- 
gion, while the C-terminal region is an 
integral membrane protein containing a 
heme cofactor. At the junction between 
these two domains is a cysteine-rich re- 
gion that has the same consensus mo- 
tif for 2[4Fe-4S] clusters, as found in a 
number of complex redox proteins, such 
as heterodisulfide reductase. This hetero- 
geneous class of enzymes is found in 
methanogenic Archaea and is involved in 
biological methane formation, the reduc- 
tion of methyl-coenzyme M [72]. In other 
related a-proteobacteria, orthologs of cobZ 
are observed but in all these organisms 
cobZ is found as two separate genes, en- 
coding the N- and C-terminal regions of 
CobZ. A reaction mechanism has been 
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proposed suggesting that CobZ acts as 
a monooxygenase, whereby the reduced 
flavin reacts with oxygen to generate a per- 
oxy intermediate that hydroxylates the C20 
position of precorrin-3A; this then allows 
the subsequent y-lactone formation to pro- 
ceed, with the generation of precorrin-3B. 
The oxidized flavin is then reduced by elec- 
trons fed via the Fe-S centers and the heme 
group [62]. 

As mentioned above for CobG, CobZ 
also displays sequence similarity to pro- 
teins not involved in cobalamin biosynthe- 
sis—in this case TcuA and TcuB, which 
are found as part of an operon in S. enter- 
ica. The tricarballylate utilization operon 
(tcu) contains three genes, tcuwA, -B, and 
-C. The gene products of tcuA and tcuB 
align with the N and C termini of CobZ, 
respectively. TcuA is a soluble flavopro- 
tein containing 467 amino acids with a 
noncovalently bound FAD cofactor, and 
displays 47.7% similarity to the N termi- 
nus of CobZ. TcuB, which reveals 50% 
similarity to the C terminus of CobZ, is 
a membrane-associated protein contain- 
ing 379 amino acids that was formerly 
annotated as citB. At its N terminus it 
houses two 4Fe-4S clusters, while the C 
terminus consists of six transmembrane 
domains containing a noncovalently at- 
tached heme [73] (Fig. 4). It was proposed 
that TcuB acts as an electron shuttle 
reoxidizing TcuA. The dehydrogenation 
of tricarballylate yields a reduced TcuA 
(FADH)), which then passes the reducing 
equivalents to TcuB via its Fe-S centers 
and heme onto the quinone pool located 
in the membranes, where they can en- 
ter the electron transport chains [73]. In 
this respect, TcuA and TcuB undertake a 
reaction that is significantly different to 
the monooxygenase function of CobZ, re- 
quiring electrons to flow in the opposite 
direction. 
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FAD 


Fig. 4 Comparison of the domain architecture of CobZ 
(precorrin-3B synthase) with the tricarballylate utilization 


proteins TcuA and TcuB. 


So far, two different solutions to ring 
contraction have been identified, catalyzed 
by the enzymes CobG and CobZ. Yet, it is 
very likely that Nature has developed more 
than these two, as the purple non-sulfur 
bacterium R. sphaeroides appears to have 
evolved yet another mechanism for ring 
contraction. Although its genome does 
not contain an ortholog to either cobG 
or cobZ, there are nevertheless two un- 
characterized open reading frames in the 
cobalamin biosynthetic operon. One of 
these is predicted to encode a flavoprotein, 
and the second a membrane-associated 
protein, NnrS, which is involved in nitric 
oxide response [74]. NnrS is a putative 
heme-copper protein which can, in some 
cases, be found in bacteria encoding ni- 
trite and/or nitric oxide reductase [74]. 
In this respect, these two enzymes re- 
veal similarity to CobZ, although it must 
be proven that they are responsible for 
ring contraction in the cobalamin path- 
way of R. sphaeroides. The sequencing of 
increasing numbers of genomes has re- 
vealed that there are several B,7-producing 
organisms that lack a ‘‘conventional” 
monooxygenase such as CobG and CobZ. 
The enzymatic step of ring contraction 
is highly variable in bacteria, and it has 
become clear that Nature has evolved sev- 
eral different mechanisms and enzymes 
to catalyze the ring-contraction process 
(69, 75]. 


4 
The Methylases in the Cobalamin 
Biosynthetic Pathway Are Largely Derived 
From One Common Ancestor 


CobG and CobZ lack sequence homologs 
in the anaerobic pathway, and appear 
to have evolved from different ancestors 
[75]. In contrast, the methylases involved 
in corrin construction not only have ho- 
mologs between the modern and ancestral 
pathways but also display an example of 
enzyme evolution as they appear to have 
evolved from a common precursor. Here, 
Nature employed one enzyme to evolve 
many different reactions. 

There are eight methylation events, 
catalyzed by six or seven methyltrans- 
ferases. The first two additions (at po- 
sitions C2 and C7) of the macrocycle 
are carried out by CobA/CysG*. The se- 
quence of the remaining methylations 
starts at C20, catalyzed by Cobl or CbiL, re- 
spectively, followed by C17 (CobJ/CbiH), 
C11 (CobM/CbiF), C1 (CobF/CbiD). The 
methylations at C5 and C15 are com- 
pleted by either the bifunctional CobL 
or CbiE and CbiT. Despite the differ- 
ent reactions, these methylases catalyze 
in both pathways, and share a signifi- 
cant sequence similarity to each other, 
except for CobF and CbiD [76]. A phy- 
logenetic tree of the methyltransferases 
involved in cobalamin biosynthesis (Fig. 5) 
clearly illustrates this connection. The tree 


Cobl 


CbiL 


Synthetic Biology in Metabolic Engineering 


CobJ 


Gebk CbiE 


CbiH 


C5/C15 
CbiF 


CobM 


CobA 


CysG 


Fig. 5 Schematic phylogenetic analysis illustrating the se- 
quence similarity between the methyltransferases associated 
with the early (green) and late (red) cobalt insertion path- 
ways. The branches are labeled with the number of the car- 


bon atom that is methylated. 


also shows a division of the branches 
representing enzymes of the modern and 
ancestral pathways, indicating substrate 
specificity for non-cobalt-containing and 
cobalt-containing intermediates. Indeed, 
it has been shown that the methyltrans- 
ferases from the aerobic pathway do not 
recognize the anaerobic substrate equiv- 
alent [77]. From an evolutionary point 
of view, this suggests that the different 
methyltransferases of corrin biosynthe- 
sis have evolved from a common evolu- 
tionary methyltransferase ancestor before 
their separation into aerobic and anaerobic 
pathways. 

Strong evidence for this theory is also 
provided from the structural fold of the 
enzymes. Over the past decade, many 
methyltransferases involved in cobalamin 
biosynthesis have been crystallized and 
have had their structures determined [28, 


78-81] (Fig. 6). Most of these enzymes, 
such as CysG, CbiF, CbiE, and CbiL, be- 
long to the class III methyltransferases 
[82]; this class is one of the five different 
structural folds (I-V) that have been de- 
scribed for SAM-dependent methyltrans- 
ferases. Members of class III share com- 
mon structural features: they are homod- 
imeric, with each domain consisting of two 
a/B domains linked by a single stretch of 
polypeptide, giving the impression ofa kid- 
ney shape. The active site in this structural 
family is folded into a cleft between the two 
a/B domains, each containing five strands 
and four helices. The methyl donor, SAM, 
is tightly bound between the two domains 
in a bent conformation, which suggests 
that the methylation is driven by a confor- 
mational distortion of SAM [83]. 

With increasing amounts of structural 
information becoming available, a detailed 
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Modern Ancestral 


Fig. 6 The cobalamin methyltransferases evolved their temporal and regiospecificity from a common ancestral protein. 
B-factor analysis of the structures of enzymes and the position methylated is given. 


comparison of protein topologies becomes 
possible. The B-factor or Wilson B-factor 
(also known as the temperature factor) ac- 
counts for the thermal motion of atoms, 
and provides an indication of the order 
or disorder of a crystal, such that areas of 
rigidity and flexibility in an enzyme can 
be identified. All of the methyltransferases 
have revealed a high degree of flexibil- 
ity around the substrate-binding pockets 
(highlighted in orange in Fig. 6), and 
have a comparatively rigid core (tinted in 
blue). 

Despite their similar overall fold, the 
seven cobalamin biosynthetic methyl- 
transferases display both a high level of 
regiospecificity and temporal specificity. 
The timing and order of methylations 
occurring is important, as these deter- 
mine the position of the double bonds 
in the macrocycle and orchestrate the re- 
activity in preparation for the next step 
in the pathway [84]. The methylations 
also prevent prototrophic rearrangements 
and, ultimately, oxidation of the final 
corrin product. The significance of the 
methylations is likely to have grown in 
importance over time as the world’s en- 
vironmental conditions became enriched 
with oxygen. Therefore, the methyltrans- 
ferases involved in cobalamin biosyn- 
thesis represent an evolutionary adapta- 
tion to ensure the synthesis of a stable 
coenzyme. 

Key to the success of this adaptation is 
an understanding of how these methyl- 
transferases evolved their regiospecificity, 
and how they are able to discriminate 
between metal-free and metal-containing 
substrates. In order to direct methy- 
lation, the enzymes must bind the 
tetrapyrrole-derived substrate in a specific 
orientation in close proximity to SAM soas 
to promote the eletrophilicity of the methyl 
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donor. Using the available structural infor- 
mation on these methyltransferases, it is 
possible to identify individual key amino 
acids involved in the different reactions. 
Moreover, by comparing this family of 
enzymes it is also possible to deduce a se- 
quence for an ancestral enzyme that would 
have promiscuous methyltransferase ac- 
tivity, where the site of methylation would 
be dictated by the reactivity of the tetrapyr- 
role rather than by the regiospecific bind- 
ing of the substrate in the active site. The 
synthesis of such an enzyme could help to 
understand what modifications are neces- 
sary to change substrate recognition and 
direct catalysis, and in this way such an 
approach would help contribute to an aim 
of synthetic biology of rational enzyme de- 
sign for the construction of altogether new 
pathways. 


5 
CobA: Enhancing Pathway Productivity 
through Protein Engineering 


The significance of being able to control 
the activity of key enzymes within a path- 
way was also explored with the cobalamin 
biosynthetic pathway. Here, it was found 
that in P. denitrificans the CobA displayed 
substrate inhibition, presumably as a con- 
trol mechanism to prevent increased flux 
along the cobalamin biosynthetic pathway 
when there was a need for increased heme 
synthesis from uro’gen III. However, a 
comparison of CobAs from a range of 
different organisms indicated that, in the 
Archaea, the enzyme does not display sub- 
strate inhibition. By replacing the CobA 
with a version that does not have sub- 
strate inhibition, the research group at 
RPR was able to increase the yield of 
cobalamin production in the host strain 
[85]. 
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6 
Channeling Efficiency 


An interesting question concerning cobal- 
amin biosynthesis is how can a pathway 
with so many unstable intermediates op- 
erate efficiently within the cell, especially 
if only small quantities of the final prod- 
uct are required? It appears that many 
of the cobalamin biosynthetic enzymes 
bind the products of their reactions quite 
tightly, and in this respect enzymes such 
as CobJ, CobK, and CobH can be isolated 
as enzyme-—product complexes. Moreover, 
CobE, a protein of unknown function, was 
also shown to be capable of binding a range 
of cobalamin biosynthetic intermediates, 
in particular precorrin-7 and precorrin-8. 
It seems likely that the pathway has evolved 
so that the enzymes retain their products 
and then pass them directly onto the next 
enzyme in the pathway, utilizing a pro- 
cess of substrate channeling to increase 
the efficiency of the biosynthesis and to re- 
duce loss due to oxidation and breakdown 
of the pathway intermediates. The prop- 
erty of tight product-binding enzymes can 
be exploited to help in the elucidation of 
the pathway through the overproduction 
of such enzymes, allowing the isolation of 
stabilized products [35]. 


7 
Biosynthesis of Mismethylated Analogs of 
Cobalamin 


In the aerobic pathway, the conversion of 
precorrin-6B into precorrin-8 is catalyzed 
by the multifunctional enzyme, CobL [32]. 
This transformation requires methylation 
of the tetrapyrrole framework at both the 
C5 and C15 positions, together with the 
decarboxylation of the acetate side chain 
attached to C12 (ring C) (Fig. 7). An 


analysis of the CobL sequence identified 
two distinct domains, suggesting that the 
protein is the result of a gene fusion event 
between a class I methyltransferase and 
a canonical class II type enzyme [34]. In 
the anaerobic pathway, the orthologous en- 
zyme is frequently found as two individual 
proteins, CbiE and CbiT, which demon- 
strate similarity to the C- and N-terminal 
domains of CobL, respectively. Through 
dissection of the CobL protein, the func- 
tion of each of the two domains has 
been ascribed [35]. The C-terminal domain 
(CobL°) was shown to catalyze the con- 
certed SAM-dependent methylation at C15 
and decarboxylation of the ring C acetate, 
and this precedes the SAM-dependent 
methylation of the C5 position by the 
N-terminal domain (CobLN). By using the 
truncated CobL° protein, it was possible to 
convert precorrin-6B into a new intermedi- 
ate precorrin-7. Addition of the N-terminal 
domain of CobL resulted in the forma- 
tion of precorrin-8, which could be further 
converted into HBA with the addition 
of CobH (which catalyzes the suprafa- 
cial 1,5-sigmatropic rearrangement of the 
methyl groups attached to the C11 and 
C12 positions). Surprisingly, the in vitro 
incubation of precorrin-7 with CobH re- 
sults in the formation of a new com- 
pound, C5-desmethyl-HBA, which lacks 
the methyl group at the C5 position [35] 
(Fig. 7). When the biosynthetic pathway 
was artificially reconstructed in E. coli 
(which lacks the genetic ability to pro- 
duce cobalamin de novo) in a stepwise 
fashion, the effect of the removal of this 
domain could be tested in vivo. Strains 
which were transformed with a plasmid 
(cobAIGJMFKLH) containing all of the 
genetic information to make HBA from 
uro’gen III appeared a pinkish orange 
color, and its accumulation could be de- 
tected in the cytoplasm [35, 62]. When the 
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Fig. 7. The synthesis of C5-desmethyl-hydrogenobyrinic acid from precorrin-7. 
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methyltransferase domain responsible for 
the C5 methylation (CobLN) was removed, 
the resulting strain now accumulated 
C5-desmethyl-HBA [35]. Thus, the dele- 
tion of a single methyltransferase domain 
from the cobalamin biosynthetic path- 
way resulted in the production of a novel 
derivative of the corrin ring which does not 
normally occur in Nature. This highlights 
the ease with which a range of non- 
physiological corrin derivatives could be 
biosynthetically created in a combinatorial 
manner through the removal of discrete 
enzymatic activities from the pathway. 

The observation that CobH will 
accept precorrin-7 as a substrate raises 
the question of why desmethylated 
products are not normally observed 
in cobalamin-producing bacteria, as a 
consequence of the enzyme reacting 
out of turn. This aberrant activity is 
perhaps precluded by two possible control 
mechanisms that are present in the 
native pathway. First, fusion of the two 
CbiE- and CbiT-like methyltransferase 
domains into the CobL protein may 
offer an enhanced metabolite transfer 
between the two proteins, ensuring that 
precorrin-7 does not accumulate. Second, 
identification of the chaperone protein, 
CobE, which has the ability to bind the 
intermediates between precorrin-6B and 
HBA and has been suggested to shuttle 
labile intermediates between the various 
Cob enzymes, and thus could prevent 
CobH catalyzing the rearrangement out 
of turn [35]. 

In a similar experiment, the genes re- 
quired for the anaerobic biosynthesis of 
cobyrinic acid a,c-diamide were overex- 
pressed in an E. coli host strain. In this 
case, when the deletion of the cbiD gene 
was investigated the results indicated that 
CbiD was involved in the methylation of 
the Cl position and, interestingly, that 


strains carrying the deletion produced 
Cl1-desmethyl-cobyrinic acid a,c-diamide 
[31]. 

These two examples have emphasized 
the use of both pathways to generate cor- 
rin derivatives lacking methyl groups at 
either the C1 or C5 positions around the 
macrocycle. It is also possible to generate 
overmethylated products which contain 
additional methyl groups. This activity has 
been well documented for those enzymes 
which have uro’gen II] methyltransferase 
activity, and this is probably in part due to 
the fact that they are the most extensively 
studied of the cobalamin methyltrans- 
ferases. The first committed step in the 
biosynthesis of cobalamin is catalyzed by 
CobA, and involves the SAM-dependent 
methylation of the C2 and C7 positions 
of uro’gen III, which in turn generates 
precorrin-2. An identical reaction is also 
catalyzed by CysG, which is a bifunctional 
enzyme, responsible for the synthesis of 
siroheme from uro’gen III. Some of these 
proteins, when overexpressed, have been 
shown to catalyze an additional methyla- 
tion of the tetrapyrrole framework at the 
C12 position, which leads to the dead-end 
product trimethylpyrrocorphin [86-88]. In 
the early cobalt insertion pathway, CbiF 
has been shown to methylate an earlier 
intermediate cobalt-precorrin-3 [89]. It is 
likely that further mis-methylated prod- 
ucts will be identified in the future as 
knowledge of the remaining cobalamin 
methytransferases is advanced. 

The observed substrate promiscuity in 
cobalamin biosynthetic pathways implies 
that many of the enzymes have a fairly 
broad substrate specificity. Some of the en- 
zymes can operate out of turn, and many 
will tolerate substitutions; consequently, 
these enzymes will likely tolerate further 
chemical or biochemical modification of 
the macrocycle, making the pathway an 


ideal candidate for the enzymatic gener- 
ation of analogs through mutasynthesis. 
The importance of metabolite channeling 
in the wild-type system is also indicated. 
Tight binding of the precorrin products to 
their respective enzymes has been utilized 
advantageously, where the purification of 
intermediates as enzyme—product com- 
plexes has greatly enhanced the present 
understanding of the pathway and has 
granted access to hitherto unavailable 
compounds [35]. However, this behavior 
is most likely linked to the kinetic labil- 
ity of the intermediates, and suggests that 
to maintain the expected flux through the 
pathway would require the efficient trans- 
fer of intermediates between enzymes 
through their transient interaction, pos- 
sibly in the form of a loosely associated 
metabolon [35, 90]. 

It is useful to emphasize that the over- 
production of enzymes and reconstruction 
of full and partial biosynthetic pathways 
can sometimes lead to an accumulation of 
unexpected material (using the cobalamin 
pathway as an example; mis-methylated 
products can sometimes be detected). This 
non-native behavior is probably exacer- 
bated by the increased concentration of 
enzymes and substrates beyond their typi- 
cal physiological concentrations. 


8 
Substitution of the Metal lon 


The biological function of cobalamin 
is critically dependent on the centrally 
chelated cobalt ion, and thus exchange of 
this transition metal is an obvious choice 
for manipulation. This first requires the ac- 
quisition of metal-free compounds, since 
once cobalt has been inserted its removal 
is not possible without concomitant de- 
struction of the macrocycle [91]. Currently, 
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no enzymes with dechelatase activity have 
been identified. However, a number of 
metal-free corrinoids ranging from HBA 
to hydrogenobalamin have been isolated 
from certain species of photosynthetic 
bacteria, typically when grown under con- 
ditions of limiting cobalt [92-97]. Based 
on the current understanding of the late 
cobalt insertion pathway (which these 
organisms operate), it is unclear as to 
how these descobaltocorrinoid are pro- 
duced. Cobalt is inserted, after the syn- 
thesis of the corrin ring is completed, by 
the CobNST cobaltochelatase into HBAD 
[14]. Therefore, the isolation of HBA 
and its mono and di-amidated products 
could be predicted since these are natu- 
rally occurring. These intermediates can 
also now be produced in large quantities 
from recombinant E. coli strains harbor- 
ing incomplete pathways [35, 62]. How- 
ever, the progression of these cobalt-free 
compounds through the pathway seemed 
doubtful, as the enzyme responsible for 
completing amidation of the side chains, 
CobQ, requires both the presence of the 
cobalt and an adenosyl group attached 
as the upper axial ligand for activity [15, 
98]. 

The availability of descobaltocorrinoids 
has opened up the possibility of inserting 
alternative transition metal ions into the 
corrin framework (Fig. 8). The insertion 
of zinc [95, 96], copper [95, 96, 99], man- 
ganese [100], and rhodium [97, 101-104] 
into these metal-free compounds has been 
reported. Chelation of these alternative 
metals has been performed chemically, 
usually in either ethanol or alkaline aque- 
ous solution at elevated temperature, as 
the cobaltochelatase complex is only able 
to insert cobalt [14]. This is in contrast to 
bacteriochlorophyll synthesis, where it is 
possible to obtain zinc bacteriochlorophyll 
using a behD mutant [105-107]. 
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Fig. 8 Periodic overview of the metals that have been chemically inserted into the corrin macrocycle. 


The coordination environment and 
properties of these metal analogs differ 
greatly from those of the cobalt-containing 
molecules (e.g., the copper derivatives are 
no longer able to coordinate a lower axial 
ligand [99]). Such metal analogs of cobal- 
amin and its precursors are of interest in 
the study of B;2-dependent enzymes, the 
cobalamin biosynthetic enzymes, and also 
as possible antimicrobial agents. 

The Schilling test is an example of an 
early medical application for such form 
of modification. In this instance, radioac- 
tive isotopes of cobalt (typically °7Co or 
°8Co) were incorporated into the macro- 
cycle through bacterial fermentation. The 
resulting compounds were first used in 
the diagnosis of pernicious anemia in 1952 
[108]. 


9 
Tailoring of the Nucleotide Loop 


Cobalamin has both upper (Cof) and 
lower (Coa) axial ligands coordinating to 
the cobalt ion at the heart of the cor- 
rin ring. The lower axial ligand is DMB, 
and forms part of the nucleotide loop 
which is affixed to the macrocycle, via an 
aminopropanol linker, through derivatiza- 
tion of the C18 propionic acid (ring D). 
Several naturally occurring By) analogs, 
with alternative nucleotide loops, have 
been isolated from a range of different 
microorganisms [109-111]. These analogs 
typically differ in the chemical nature of 
the base, where DMB could be substi- 
tuted for benzimidazole derivatives, sub- 
stituted purines, or phenolic compounds 
[110, 111]. Many microorganisms have 
the capability to synthesize a range of 
these derivatives. For example, S. enterica 
produces three corrinoids with different 
lower ligand bases—cobalamin (DMB), 
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pseudocobalamin (adenine), and factor A 
(2-methyladenine) [112, 113]. Corrinoids 
with alternative linkers have also been 
isolated, for example, S. multivorans was 
found to use ethanolamine in place of 
aminopropanol [109]. The reason for such 
variation in the lower ligand is unclear, and 
remains a longstanding question in the 
field. The ability of an organism to tolerate 
lower loop variation also differs greatly. E. 
coli (which is unable to synthesize cobal- 
amin de novo) appears to accept a wide 
range of lower base analogs [114]. How- 
ever, numerous Bj7-dependent enzymes 
show a strong preference for one partic- 
ular form of the cofactor; these include 
human methionine synthase which is 
over 1000-fold more active with cobalamin 
than pseudocobalamin [115] and gluta- 
mate mutase from Clostridium tetanomor- 
phum, which is 70-fold more active with 
benzimidazyl-cobamide than cobalamin 
[116]. The reason for this variation is 
not always apparent and probably differs 
from enzyme to enzyme, but likely origi- 
nates from differences in both the catalytic 
properties of the cofactor and the binding 
affinity to the enzyme. For instance, the 
phenolic derivatives (phenol and p-cresol) 
cannot coordinate the cobalt ion, which 
means that these compounds only exist in 
the base-off conformation and thus are not 
functional with an enzyme requiring the 
base-on form [117]. 

As seen with S. enterica, microorganisms 
are often able to produce cobamides with 
a range of different bases. The choice 
of the lower ligand can in some cases 
be attributed to environmental conditions 
and metabolite availability. When grown 
under microaerobic conditions, S. enterica 
will produce cobalamins, whereas under 
anaerobic conditions the products are 
pseudocobalamin and factor A [113]. The 
choice of end product can also be guided by 
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the addition of the respective bases to the 
media, and this method has been used to 
produce a range of natural and non-natural 
derivatives [118-121]. 

Most microorganisms — even those with 
the ability to synthesize cobamides de novo 
— are able to salvage incomplete corrinoids 
from the environment and complete the 
synthesis of the lower nucleotide loop 
[122]. Some Archaea and Bacteria, which 
have a lower ligand specificity, also possess 
the ability to exchange the lower loop 
[111, 123]. One such mechanism has 
been elucidated in R. sphaeroides, where 
the cobinamide amidohydralase (CbiZ) 
enzyme is able to cleave the lower loop, 
forming adenosylcobinamide to which 
the appropriate nucleotide loop can be 
attached [124, 125]. 


10 
Cobalamin Conjugates 


The use of vitamin conjugates is emerging 
as a promising strategy for the target- 
ing and delivery of therapeutic molecules. 
Such approaches hijack the body’s nat- 
ural transport mechanisms to deliver 
the drug molecule. There is known to 
be a high demand for cobalamin in 
many disease states, and that it is es- 
sential for tumor growth; accordingly, 
cobalamin-conjugated drugs have signif- 
icant potential in the treatment of certain 
types of cancer [126]. 

Cobalamin is produced by bacterial fer- 
mentation, since the total synthesis, which 
required more than 70 steps, is not a 
commercially viable option [4]. This has 
adversely affected the development of 
cobalamin-based drugs by limiting the 
available conjugation strategies. However, 
there are several functional groups which 
can be modified for conjugation with other 


compounds; these include alkylation of 
the cobalt ion, modification of the amide 
side chains, conjugation to the ribose 
5’-hydroxyl, and through the incorpora- 
tion of benzimidazole or aminopropanol 
derivatives (Fig. 9). 

The easiest strategy for the derivatiza- 
tion of cobalamin is alkylation of the 
cobalt metal center. Modification at this 
position does not seem to affect binding 
to transcobalamin (II) or intrinsic fac- 
tor (intrinsic factor and transcobalamin 
are cobalamin-binding and transport pro- 
teins involved in the uptake of Biz from 
the small intestine and its circulation 
in plasma, respectively) [127, 128]; how- 
ever, the instability of the cobalt—carbon 
bond does limit the use of these com- 
pounds as pharmaceuticals. The chemical 
environments of the amide side chains 
are all very similar, which means that is 
difficult to achieve specificity and purifi- 
cation is difficult as a range of products 
can be produced, resulting in low yields 
[129]. Many of these side-chain adducts 
also show a diminished binding to the 
cobalamin transport proteins; only substi- 
tution at the e-side chain appears tolerable 
[129-131]. Modification of the lower loop 
at the 5/-hydroxyl appears not to affect 
binding to transcobalamin, although this 
has largely been achieved through an es- 
ter linkage which might be cleaved in vivo 
[129]. 

An explanation of why the two optimal 
positions for modification are the e-side 
chain and the ribose hydroxyl has come 
from structural studies, where analysis 
of the crystal structure of the complex 
between cobalamin and transcobalamin 
demonstrated that these are the most 
solvent-accessible positions [132]. 

The use of cobalamin conjugates has 
huge potential, although careful consider- 
ation towards the position of attachment, 
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Fig. 9 Representation of the substituents 

of cobalamin which are amenable to modi- 
fication and conjugation. Amide side chains 
are highlighted in green, substitution of the 


spacer length and the coupling chemistry 
used is required to construct a compound 
that can be translated into an effective 
therapy. In the future, synthetic biology 
approaches may allow a derivatization of 
the corrin ring through modification of the 
biosynthetic pathway, and thus the syn- 
thesis of a new generation of therapeutic 
cobalamin conjugates. 


11 
Physical Compartments for Pathway 
Sequestration 


Within the discussions on cobalamin 
biosynthesis, it has been highlighted how 


aminopropanol linker in orange, substitution 
of the lower base (DMB) in purple, ribose OH 
group in maroon, exchange of the cobalt ion 
in pink, and the upper axial ligand in black. 


the synthesis of this nutrient may be 
assisted by substrate channeling. Other 
forms of directed metabolism are found 
in Nature, including the use of multien- 
zyme complexes and compartmentaliza- 
tion. Here, a link between cobalamin and 
bacterial organelles is described. 

Cells are biologically complex systems 
that require a high degree of organiza- 
tion to solve the challenges of toxic path- 
way intermediates, competing metabolic 
reactions, and slow turnover rates. In 
order to provide a_ three-dimensional 
organization and the optimization of 
cellular processes, eukaryotic and prokary- 
otic cells have evolved the scheme of 
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compartmentalization, which facilitates 
the isolation of metabolic pathways and 
subsequent linking through controlled 
proximity. Examples of compartmentaliza- 
tion include multienzyme complexes such 
as metabolons associated with metabolic 
pathways (tricarboxylic acid cycle [133] and 
glycolysis [134, 135]), membrane-bound 
organelles, bacterial microcompartments 
(BMC)s, and magnetosomes [136]. 

More recently, BMCs have become at- 
tractive engineering objects for the effi- 
cient assembly of metabolic enzymes and 
the production of high titers of valuable 
compounds. Aspects of the biogenesis, 
engineering and biotechnological appli- 
cations of BMCs are discussed in the 
following sections. 


11.1 
Bacterial Protein-Based 
Microcompartments and Their Functions 


Bacterial microcompartments are proba- 
bly the largest protein-based macromolec- 
ular assemblies found in bacterial cells. 
They consist of an outer shell that encases 
a particular metabolic process that is con- 
nected to the remainder of the cellular 
metabolism by selective pores in the pro- 
teins that form the shell. Many bacteria 
produce these organelles, which encapsu- 
late vital pathways in order to facilitate 
the channeling of substrates [137], co- 
factor recycling [138], the mitigation of 
toxicity [139, 140], and the control of evap- 
orative loss [141]. Based on experimen- 
tal evidence and comparative genomics, 
seven main classes of BMC have been 
reported according to the core enzymes 
and pathways confined within their lu- 
men [142]. These classes of BMCs are 
associated with CO) fixation, metabolism 
of ethanolamine, ethanol, 1,2-propanediol, 


amino alcohol, and fuculose 1-phosphate 
142]. 

The carboxysomes were the first of the 
BMCs to be discovered. A carboxysome is a 
highly organized quasi-icosahedral struc- 
ture of about 120nm diameter, found in 
multiple copies in cyanobacteria and some 
chemoautotrophic bacteria [143-148]. The 
carboxysome houses the Calvin cycle en- 
zyme ribulose-1,5-bisphosphate carboxy- 
lase/oxygenase (RuBisCO) and also car- 
bonic anhydrase, which fixes gaseous CO) 
to produce ribulose-1,5-bis-phosphate and 
to form 3-phosphoglycerate (Fig. 10). De- 
pending on the type of RuBisCO and car- 
bonic anhydrase present, two types of car- 
boxysomes — termed alpha and beta — have 
been discovered [149]. The carboxysome 
shell is proposed to act as diffusion barrier 
by retaining CO? inside the carboxysome 
and differentially blocking the competing 
substrate oxygen from entering; this leads 
to an elevation of CO, concentration in 
the immediate vicinity of RuBisCO, and to 
an increase in the CO) fixation rate [143, 
150-154]. 

The carboxysome was the only 
BMC to be observed for many years, 
until large-scale genomic sequencing 
allowed genetic studies that revealed 
many bacteria with the ability to create 
related carboxysome-like structures 
[141]. To date, two of the best-studied 
examples are the propanediol-utilizing 
(pdu) metabolosome [139, 155-157] 
and the ethanolamine-utilizing (eut) 
metabolosome [158, 159]. Both are about 
100-150 nm in cross-section and similar 
in size to the carboxysomal shell, but 
more irregular in shape [160, 161]. They 
are produced by Salmonella enterica and 
other enteric bacteria when metabolizing 
1,2-propanediol (a compound that is 
present during the anaerobic breakdown 
of plant wall sugars) or ethanolamine as 
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Fig. 10 Cartoon representation of proposed metabolic path- 
ways associated with two types of bacterial microcompart- 
ment, the carboxysome (a) and the pdu microcompartment 
(b). RUBP: ribulose-1,5-bisphosphate; RuBisCo: ribulose bis- 
phosphate carboxylase/oxygenase; 3-PGA: 3-phosphoglycerate. 
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a carbon, energy, and nitrogen source. 
Both, the pdu and the eut degradation 
pathways require the presence of 
adenosyl-cobalamin as a cofactor. In fact, 
in Salmonellae cobalamin is primarily 
synthesized to support the degradation 
of propanediol [162]. The genes for 
propanediol degradation are organized in 
the pdu operon and located adjacent to 
the cobalamin (cob) operon [76, 163]. The 
two operons are transcribed divergently, 
and the region between the pdu and cob 
operons includes features for the control 
of this regulon. Both operons are induced 
by propanediol and the positive regulatory 
protein PocR [76]. 

The pdu operon is composed of up 
to 23 genes [156, 160, 164], seven of 
which encode a total of eight shell pro- 
teins (PduA, -B and -B’, -J, -K, -N, -U, 
and -T). The other genes encode metabolic 
enzymes (PduCDE, PduL, PduP, PduQ, 
and PduW) [165, 166], cobalamin and 
dehydratase reactivation (PduGH, PduoO, 
PduS) [167-170], and cobalamin biosyn- 
thesis (PduX) [54]. One gene encodes a 
propanediol diffusion facilitator (PduF) 
[171]. Another gene, pduM, has recently 
been identified to encode a structural pro- 
tein [172], and the product of the gene 
pduV is thought to be involved in the 
interaction between the BMC and the 
cytoskeleton [157]. The requirement for 
such a large operon for the relatively sim- 
ple degradation of 1,2-propanediol can be 
explained by the need to contain the inter- 
mediate propionaldehyde within a protein 
compartment to reduce toxicity to the re- 
mainder of the cell, to mitigate DNA 
damage, and to prevent the loss of pro- 
pionaldehyde as a gas [140, 173]. The pdu 
microcompartment also facilitates the re- 
cycling of multiple cofactors, enzymatic 
reactivation, and enables the channeling of 
several intermediary metabolites (Fig. 10). 


As 1,2-propanediol enters the lumen of 
the pdu organelle it is converted to propi- 
onaldehyde by the Bj2-dependent enzyme 
complex diol dehydratase (PduC, PduD, 
and PduE). Propionaldehyde is converted 
to propionyl-CoA and 1-propanol. The lat- 
ter is a disproportionation reaction that 
allows the recycling of NAD* inside the 
microcompartment [138]. Propionyl-CoA 
leaves the microcompartment to be further 
metabolized to propionate. Both, propi- 
onate and 1-propanol feed into the central 
metabolism via the methyl-citrate pathway 
[174] (Fig. 10). 

Interestingly, pdu microcompartments 
from the Gram-positive Lactobacillus 
reuteri strain 20016 (originally isolated 
from human feces) have also been 
reported to metabolize glycerol to 
produce 1,3-propanediol, using the 
cobalamin-dependent diol dehydratase 
(because the organism does not possess 
a distinct glycerol dehydratase) [164]. 
During that metabolism, L. reuteri can 
also produce reuterin, an antimicrobial 
agent. Both, reuterin and 1,3-propanediol 
are of industrial importance, the latter 
as a starting material for the production 
of plastics. The production process for 
1,3-propanediol from glycerol relies on the 
activity of diol dehydratase, which converts 
glycerol to 3-hydroxypropionaldehyde, 
which is then reduced to 1,3-propanediol 
by the action of a 1,3-propanediol:NAD 
oxidoreductase [164, 175]. The limiting 
factor for the biotechnological production 
of 1,3-propanediol is the dehydratase 
activity [176]. Bioengineering of the 
pdu pathway for improved reuterin and 
1,3-propanediol productions may be of 
interest for the medical and renewable 
chemical industries. 

Other proteinaceous compartments in 
bacteria include lumazine synthase com- 
plexes [177] and encapsulins [178]. These 


compartments are less than 30nm in 
diameter and are composed of 60-180 
subunits of a single protein, where the 
number of subunits is dependent on the 
cellular environment. They contain mostly 
single enzymes in a low copy number 
of about 10 per shell [179]. Organisms 
forming encapsulins have two-gene oper- 
ons consisting of either a gene encoding 
for an iron-dependent peroxidase (DyP) 
or for a protein that is closely related to 
a ferritin-like protein (Flp) and a gene 
encoding for an encapsulin protein. The 
enzymes DyP or Flp reside in the lumen 
of a simple nanocompartment that is as- 
sembled from 12 homopentamers of the 
encapsulin protein [178]. Containment of 
these enzymes in the case of the Flp pro- 
tein may help to protect the organism from 
oxidative damage through Flp iron oxida- 
tion products [180] and, in the case of the 
DyP protein, contribute to the survival of 
pathogenic bacteria such as Mycobacterium 
tuberculosis when they encounter reactive 
oxygen species as part of the host defense 
system [178]. 


11.2 
Architecture of Bacterial 
Microcompartments 


The shells of carboxysomes, 
and eut microcompartments consist 
of phylogenetically related proteins 
(Fig. 11). In the carboxysome, these 
shell proteins are termed either Cso 
(for carboxysome, a-carboxysome), or 
Ccm (for CO2-concentrating mechanism, 
B-carboxysome), whereas the — shell 
proteins associated with 1,2-propanediol 
and ethanolamine degradation pathways 
are termed Pdu and Eut. Individual shell 
proteins have a specific BMC motif and 
self-assemble into cyclic hexamers that are 
perforated by narrow pores through the 


pdu 
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sixfold axis of symmetry. The disc-shaped 
hexamers further assemble side-by-side 
into sheets that are then closed to form 
a cage (Fig. 11) [137, 153]. The packing 
arrangements of individual shell proteins 
are still unknown, and it also remains 
unclear which sides of the hexamers face 
the lumen of the compartments. 
Structural studies have shown that 
the hexamers forming the sheets be- 
long to the BMC domain class (BMC; 
Pfam 00936). The vertices of the mi- 
crocompartments are thought to be oc- 
cupied by pentamers belonging to the 
CcmL/CsoS4/PduN (Pfam 03319) domain 
class [149]. The ethanolamine protein 
EutN belongs to the latter class, but 
forms hexamers when crystallized by itself 
[148]. Typically, multiple paralogs of mi- 
crocompartment proteins can be found in 
one type of microcompartment; examples 
of these include CsoS1A/B/C/D and 
CcmK1/2/3/4 in a- and B-carboxysomes, 
respectively, PduA/B/J/K/T/U in pdu and 
EutL/M/K/S in eut microcompartment 
shells. The similarity between the various 
shell proteins of different types of mi- 
crocompartment is illustrated in Fig. 11. 
Multiple distinct BMC folds are adopted 
to facilitate functional differentiation be- 
tween the protein paralogs [181]. Genetic 
events have also led to fusions between 
certain BMC domains, resulting in two 
tandem copies [182, 183]. Three copies of 
such tandem proteins assemble to make 
a symmetric trimer with pseudo-sixfold 
symmetry [184, 185]. Other deviations of 
symmetry or structure include bent hex- 
amers formed by an Eut protein (EutS) 
[186] and double-disk structures observed 
in a carboxysome protein (CsoS1D) [182]. 
BMC proteins that represent the major 
components of their respective micro- 
compartment shells (e.g., PduA in the 
pdu microcompartment) have small pores 
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BMC domain class CcmL/CsoS4/EutN/PduN domain class 
Ethanolamine 
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FA 


v 
op 


Illustration of the similarity between the 


Fig. 11 


s 


shell pro- 


teins of different microcompartments and their higher organi- 


zation. (Adapted from Ref. [157].) 


at the center of the shell proteins with 
different features in size, electrostatic and 
hydrogen bonding that are presumed to 
allow transport of substrates, products, 
and cofactors. It was suggested that the 
transport of larger molecules without the 
escape of sequestered intermediates might 
be facilitated by gated pores that are 
usually found in tandem domain shell 
proteins. 

Evidence for gated pores has been 
derived from the crystal structures of the 
pdu microcompartment protein PduB and 


a-carboxysomal shell protein, CsoS1D. 
The tandem domain protein PduB from L. 
reuteri [184] revealed three subunit pores 
and a central pore. Each pore had glycerol 
molecules bound (L. reuteri has been 
shown to metabolize both glycerol and 
1,2-propanediol within pdu organelles). 
The subunit pores were suggested to act 
as a channel for substrate, whereas the 
central pore might be a ligand-gated chan- 
nel. The pore of CsoS1D has been shown 
to adopt two conformations, open or shut, 
and is thus thought to be gated [182]. 


Another tandem BMC domain protein 
PduT contains a Fe-S center, and is there- 
fore likely to play a redox role [156, 187, 
188]. Significantly, it has been shown re- 
cently that PduT interacts with PduS [170], 
a corrin reductase that also contains two 
4Fe-4S centers [168, 170]. The presence of 
such a redox system on the shell could 
allow either for electron transfer or for 
the transport of a redox center into the 
microcompartment. 


11.3 
Protein Sequestration into 
Microcompartments 


Because of a lack of architectural detail 
of the shell, the precise mechanism for 
the targeting and incorporation of pro- 
teins into the BMCs has not been fully 
elucidated. Enzymes inside the shell are 
likely to be encapsulated during shell 
assembly through interactions with the 
inward-facing side of shell proteins, which 
would suggest that no additional enzymes 
would be able to enter the shell after it 
has been formed. Models for carboxysome 
formation suggest that RuBisCo is simul- 
taneously aggregated and encapsulated, 
which can lead to the formation of partial 
carboxysomes, as observed using electron 
cryotomography by Iancu et al. [145]. This 
view was recently verified in a study on 
carboxysome biogenesis in Synechococcus 
PCC7942, which showed that the interior 
components of the carboxysome assemble 
first by aggregation and are then encapsu- 
lated by shell proteins [190]. 

To date, a number of N- and C-terminal 
peptide regions of proteins have been 
predicted to be required for encapsulation. 
These are thought to form a helical 
structure composed of hydrophobic 
residues, followed by less-conserved po- 
tential linker regions. These regions were 
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first noticed in amino acid alignments 
of PduD, E, and P with homologous 
proteins that are not associated with 
BMCs. N-terminal extensions found on 
microcompartment-associated proteins, 
unnecessary for enzymatic function, 
were absent from cytoplasmic homologs 
[175, 189, 190]. Reporter proteins such 
as green fluorescent protein (GFP) were 
fused onto proteins such as PduC and 
D, and this resulted in an internalization 
of the fluorophore into recombinant 
microcompartments [157]. Recently, the 
first 18 amino acids of both PduP and 
PduD have been confirmed as target 
sequences, and it has also been suggested 
that the C terminus of the shell protein 
PduA might interact with the N terminus 
of the enzyme PduP [189, 191]. 

Similarly, the C-terminal region of 
CcmN, a conserved carboxysome pro- 
tein, was found to be required for in- 
teraction with the carboxysome shell, as 
deletion of the peptide would prevent 
carboxysome formation, indicating that 
its interaction with the shell is an es- 
sential step in microcompartment for- 
mation [192]. Likewise, C-terminal exten- 
sions are also found on enzymes packed 
into encapsulins. These extensions have 
conserved amino acid sequences that in- 
teract with the inside of the encapsulin 
shell by binding to distinct pockets on 
the surface of encapsulin. In the hyper- 
thermophilic archaeon Pyrococcus furiosus, 
the iron transporter ferritin-related pro- 
tein (Flp) is encapsulated by being directly 
fused to encapsulin. 


11.4 
Engineering Microcompartments 


The pioneering steps toward engineering 
new activities into recombinantly pro- 
duced microcompartments have already 
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been taken. First, it was shown that it 
is possible to express heterologously the 
21-gene pdu operon from Citrobacter fre- 
undii in an E. coli strain that does not 
naturally have the ability to metabolize 
1,2-propanediol; the result was that E. 
coli was able to form fully functional 
pdu microcompartments [156]. An ef- 
fort to construct an empty pdu organelle 
was achieved through the coordinated 
production of a number of Pdu shell 
proteins [157]. The minimum number 
required to form an empty pdu microcom- 
partment was shown to be six proteins 
(PduA-B-B’-J-K-N). However, it may be 
possible to construct an empty microcom- 
partment with even fewer shell proteins, 
as deletions of PduA and PduK from 
the wild-type pdu operon in S. enterica 
have suggested that the organelle can still 
be formed [193]. Protein sequestration 
into organelles is mediated through sig- 
naling sequences on metabolic enzymes 
found within microcompartments. Thus, 
new proteins have to be fused to signal- 
ing sequences to facilitate encapsulation 
into empty recombinant microcompart- 
ments. It has been shown that GFP, when 
fused to a number of the enzymes that 
are normally found within the Pdu BMC 
(e.g., PduC or PduD), can be internal- 
ized [157, 191]. Likewise, when GFP was 
fused to only the signaling sequences of 
the first 18 amino acids of PduD and 
PduP, it was transferred into microcom- 
partments. More recently, a carbon-fixing 
protein organelle was transplanted into 
an E. coli host that otherwise does not 
reductively fix carbon [194]. To synthe- 
size carboxysome BMCs heterologously 
in E. coli, expression of the carboxysome 
genomic locus from H. neapolitanus, con- 
taining 10 genes encoding enzymes and 
shell proteins, was sufficient. Further, re- 
cent studies have suggested that it is 


also possible to engineer recombinant 
eut organelles from Salmonella in E. coli 
[195]. 


11.5 
Designing and Customizing 
Microcompartments 


At present, the pdu compartmentalization 
system appears the most promising en- 
gineering object for further development, 
as it has been shown to offer the most 
diverse range of Pdu shell proteins that 
cater for the encapsulation of a num- 
ber of different enzymes and metabolites. 
In contrast, encapsulins and lumazine 
synthase complexes are generally formed 
from only one type of shell protein and can 
only house a small number of metabolic 
enzymes, which makes them unsuitable 
systems. 

Studies on not only pdu metabolosomes 
but also lumazine synthase complexes, 
as well as encapsulins, have suggested 
that metabolic enzymes are not required 
for shell assembly, which simplifies shell 
assembly from an engineering perspec- 
tive. Existing and further studies on shell 
structure and architecture will enable the 
modulation of protein ratios and interac- 
tions, and thus protein sheet composition. 
As discussed above, the pores perforat- 
ing the microcompartment shell regulate 
metabolite entry and exit. Such regula- 
tion makes microcompartments especially 
suitable for the engineering of metabolic 
pathways, as enzymes can be sequestered 
into a region where metabolites are regu- 
lated. As the amino acid composition at the 
pore region is directly linked to specificity 
such as size and charge, amino acids in 
the pore may be engineered by random or 
directed mutagenesis to customize selec- 
tivity (Fig. 12). Methods for assessing path- 
way metabolites to monitor the success 
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Fig. 12 The design and customization of microcompart- 
ments. The spanners represent modifications of targeting 
mechanisms and pore properties. 


of these mutations include fluorescence 
resonance energy transfer (FRET) sen- 
sors to measure permeability for pathway 
metabolites and transcriptional biosensors 
[179]. For biotechnological purposes, the 
design of pores may also offer differ- 
ent options for product isolation — that 
is, isolation from the compartment or 
excretion into the medium (Fig. 12). Spe- 
cific targeting mechanisms and sequences 
have been discovered for different com- 
partments; however, in order to gain a 
better control of the targeting process, 
investigations into additional targeting se- 
quences and interaction mechanisms are 
required. 

Although the modulation of shell 
composition and enzyme targeting may 


be challenging, compartments _ that 
can accommodate desired synthetic 
pathways will eventually have enormous 
advantages. These include a limited inter- 
action between the cellular processes of 
the host cell and the heterologous pathway 
through physical separation of pathways, 
protecting heterologous enzymes from 
undesirable interactions, preventing 
metabolite exchange and loss, facilitating 
cofactor recycling, and increasing the 
efficiency of cellular processes (Fig. 12). 

A recently patented application is the 
accumulation of metabolic products of 
high molecular weight within BMCs. This 
was demonstrated by the targeting of 
polyphosphate kinase (Ppk) to recombi- 
nant pdu microcompartments, which has 
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Fig. 13 Microbial cell biocatalyst. A summary of how the 
genetic make-up of a microbial cell can be wired for the pur- 
pose of directing central metabolism flux towards a desired 
bioproduct, such as a biofuel or a medicinal compound. 


resulted in the twofold uptake of phos- 
phate compared with Ppk overexpression 
alone [196]. Other examples of poten- 
tial biotechnological applications include 
the compartmentalization of heterologous 
pathways for ethanol and biodiesel for 
biofuel production in E. coli and other in- 
dustrial hosts, which could result in higher 
concentrations of the product through 
greater tolerance via sequestration. Other 
strategies involve BMCs as cargo carriers 
for molecular delivery. Such cargos could 
include metal nanoparticles as well as cy- 
totoxic chemicals and proteins, and would 
have potential applications for medicinal 
uses such as tumor imaging and cancer 
therapy. 


12 
Future Directions 


Within this chapter, it has been shown how 
an ability to reconstruct existing pathways 
can be used to help in their elucidation 
and to enhance production. Moreover, the 
manipulation of these pathways can also 
result in end-product variants that can be 
used as analogs for the selective inhibition 
of specific metabolic processes (Fig. 13). 
Nature has devised a number of ways in 
which to ensure the efficiency of complex 
pathways. In the case of cobalamin biosyn- 
thesis, the products of the enzymatic reac- 
tions are held tightly as enzyme—product 
complexes that are released only when 


the subsequent enzyme is near at hand. 
Such substrate channeling allows for the 
efficient transfer of oxygen-sensitive inter- 
mediates and also ensures a high yield of 
the final product. Similar approaches can, 
of course, be applied to vitamins other 
than cobalamin. In particular, the use of 
vitamin analogs as molecular handles to 
transport cargoes into the cell is particu- 
larly attractive, while the use of vitamins as 
labels to image cells and tissues also rep- 
resents a compelling endeavor. Another 
method of increasing efficiency within 
a metabolic pathway is to use compart- 
ments, and several metabolic pathways 
involving cobalamin-dependent enzymes 
have been found to be encased with bacte- 
rial microcompartments — proteinaceous 
organelles that separate a pathway from 
the remainder of the cytoplasm. The abil- 
ity to separate and organize metabolism 
in this way has been an aspiration of 
metabolic engineers, since it allows for 
the localized enhanced concentration of 
enzymes to increase mass action through 
a pathway. Compartments also protect 
the cell against toxic intermediates, and 
provide an ability to generate bespoke 
reaction vesicles within a cell that incorpo- 
rate ergonomic design principles which 
represent a significant achievement. In 
this respect, the redesigning of bacterial 
microcompartments for the synthesis of 
fine chemicals and fuels is an achievable 
objective. 
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Keywords 


Minimal genome 

As defined by Arcadi Museghian, a genome containing the smallest number of genetic 
elements sufficient to build a modern-type free-living cellular organism. Most attempts to 
define a minimal genome have focused on the minimal protein-coding gene-set, ignoring 
functional RNAs, regulatory and other noncoding sequences, and the organization of 
these elements on chromosomes. In this article, both protein-coding and RNA genes 
have been included. 


Orthologous genes 

Genes present in different species that originated from a single gene from a common 
ancestor by speciation. They are identified through phylogenetic analyses. Usually, 
orthologs retain the same function throughout evolution. For this reason, identification 
of orthologs is used to predict gene function in newly sequenced genomes. 


Non-orthologous gene displacement (NOD) 

During the course of evolution, the displacement of a particular gene by another gene 
with a similar function that does not share a common ancestor (i.e., non-orthologous 
gene). 


Translatome 

All components of the translation machinery, including ribosomes, amino acyl-tRNA 
synthases, tRNAs, translation factors, and enzymes needed for maturation of the 
different components. It is by far the most complex part of present-day bacterial cells. 


Free-diffusing cell 

Cell model defined by Luisi and coworkers, which possesses a simplified cell 
membrane and is unable to synthesize low-molecular-weight compounds. It assumes 
that these substrates are available in the environment and can permeate the cell 
membrane into the cell cytoplasm. The membrane must be nonselectively permeable 
to low-molecular-weight compounds, but does not permit the leaking out of large 
macromolecules. 


Minimal metabolism 

Metabolic network containing only the necessary and sufficient elements to achieve 
metabolic homeostasis in a minimal modern chemoorganoheterotrophic cell; that is, 
an organism using organic compounds as both carbon and energy sources, living in a 
nutrient-rich medium, in which the major metabolites (glucose, fatty acids, nitrogenous 
bases, amino acids, and vitamins) would be available without limitation, thus making 
unnecessary the de novo biosyntheses of these basic components. 


All known living beings are made of cells, each one of which stores in its genome all of 
the information required for its correct functioning. The advent of high-throughput 
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sequencing technologies and improvements in bioinformatics tools have allowed the 
complete sequencing, functional analysis and comparison of thousands of genomes 
from different species, helping research groups to delineate the minimal set of 
functions necessary to keep a cell alive under defined environmental conditions. 
This knowledge can not only be used to obtain a better understanding of the 
phenomenon of life, but also has many direct biotechnological and biomedical 


implications. 


1 
Introduction 


It is clearly apparent that planet Earth sup- 
ports an abundance of life, and that all 
known living forms are made of cells. To- 
day, a central principle of biology is that all 
living beings are connected in a universal 
phylogenetic tree composed by three king- 
doms [1], derived from a common ancestor 
[2]. Whilst all modern cells are highly com- 
plex, it is an accepted paradigm that sucha 
high level of complexity is not a necessary 
attribute of life, as the last common an- 
cestor should have been much simpler. At 
present, it is possible to deduce which 
functions are essential for the survival 
and replication of cells, which also implies 
their interaction with the environment to 
obtain the necessary substrates for a cell to 
build its own structures, and which func- 
tions are complementary to helping cells 
survive within a specific environment. The 
identification of these essential functions 
is expected to contribute greatly to under- 
standing the principles of life. 

Cells must rely on their own gene prod- 
ucts to perform all essential functions 
because, although they can usually import 
metabolites, they cannot import functional 
proteins or RNAs. Once the essential func- 
tions have been identified, it is possible to 
delineate the set of genes needed to per- 
form such functions, and this will define a 
minimal genome. The universal minimal 
genome should be the one containing the 


smallest number of genetic elements suf- 
ficient to build a modern-type, free-living 
cellular organism [3]. However, all studies 
conducted to date (for a review, see Sect. 2) 
have highlighted the fact that it is impossi- 
ble to define a universal minimal genome 
for several reasons: 


e The furthest that this type of study 
could lead would be to describe the 
core of a universal minimal genome, 
because such studies are focused on 
defining the minimal complement of 
protein-coding genes, while a genome is 
much more than that. In order for a cell 
to be considered alive, a real genome 
must also include RNA genes, as well 
as essential regulatory components 
(including antisense RNA, whose 
implication in the regulation of the 
proteome has recently been highlighted 
in Mycoplasma pneumoniae [4]). 

e Any minimal genome model needs 
to be tied to a particular level of 
biological organization. Due to their 
apparent simplicity and the amount 
of information that has been acquired 
during the past few decades, most 
studies have focused on the definition of 
a universal minimal bacterial genome. 

e A problem to approaching a_ uni- 
versal minimal genome is that, al- 
though the informational machinery 
(involved in DNA replication, tran- 
scription, and translation) is essentially 
widely distributed in all branches of 
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life, metabolism is extremely variable 
and is highly dependent on the environ- 
ment. Therefore, a plethora of minimal 
genomes, composed ofa universal infor- 
mational core plus a variable metabolic 
gene-set, can be envisaged. 

e Finally, the genome does not explicitly 
contain the information about cell 
assembly and topological architecture, 
which represents a conceptual limi- 
tation for using the genome as the 
minimal and sufficient description of 
the cell [5]. 


Even if it is possible to define only the 
core of a putative minimal genome, this 
will still be valuable for expanding the 
present knowledge of the minimal require- 
ments for life. It is for this reason that 
many scientists from different disciplines 
have interacted during the past decades, 
with the common goal of defining the 
minimal gene components of a system to 
be considered ‘‘alive,” as an initial step 
towards the synthesis of artificial life [6]. 
The aim of this chapter is to present a 
refined version of the minimal gene-set 
machinery necessary for life, to summa- 
rize the different approaches undertaken 
in order to do so, and their limitations and 
expected outcomes in the not-too-distant 
future, taking as a starting point the Core 
of the Minimal Genome (CMG) as pro- 
posed by the present author’s group in 
2004 [7]. 

Before starting, it is worth noting that 
this is not the only approach used by 
synthetic biologists to design artificial min- 
imal cells. In fact, two opposite — but com- 
plementary — strategies have been used 
widely during the past two decades, 
namely “‘bottom-up” and “‘top-down”’: 


e Bottom-up strategies are aimed at 
constructing from scratch the simplest 


autopoietic self-reproductive chemical 
system that could be considered alive 
[8] —a ‘‘protocell” — by putting together 
the essential nonliving components, 


namely a_ self-replicating nucleic 
acid, a metabolic machinery, and an 
encapsulating structure [9]. Today, 


an increasing number of biophysical 
groups are continuing to follow this 
reductionist approach, attempting to 
dissect cellular processes and networks 
in order to identify minimal modules 
that could be combined and subjected 
to rigorous mathematical treatments 
[10, 11]. However, no one has yet syn- 
thesized a protocell, and most advances 
are devoted to the chemosynthesis of 
biopolymers (DNA, RNA, or proteins) 
inside self-reproducing vesicles [12, 
13]. In any case, there are important 
technical gaps that must be filled in 
order to synthesize a self-reproducing 
minimal cell via a bottom-up approach, 
and the results of these studies will 
produce a living system that is not 
equivalent to present-day cells. 

e Incontrast, top-down strategies are based 
on the study of modern cells, in an 
attempt to identify all essential genes, 
aiming to define a simplified cell that 
retains the attributes of life. The differ- 
ent top-down approaches that have been 
performed during the genomic era will 
be reviewed in detail in the following 
sections. 


2 
Top-Down Approaches to the Minimal 
Gene Set 


During the past 15 years, many theoretical 
(comparative genomics, comparative pro- 
teomics, and modeling) and experimental 


top-down approaches have been used in 
order to determine the minimal gene-set 
for a modern cell (Table 1). Although, 
in the following subsections, each study 
is cataloged within a given strategy, it 
should be noted that in many cases sev- 
eral approaches were combined to define 
a minimal genome. 


Zi] 

Preserved Genes to Approach the Minimal 
Genetic Core: The Power and Pitfalls of 
Comparative Genomics 


Comparative genomic analyses are pri- 
marily based on the alignment of DNA 
or (more frequently) protein sequences, 
in order to identify orthologous genes 
in genomes of distantly related species. 
Nowadays, it is possible to analyze hun- 
dreds of completely sequenced genomes to 
identify what is always present in modern 
cells and what can be considered acces- 
sory, assuming that genes shared between 
distant organisms are likely to be essential. 

The power of comparative genomics 
was highlighted in 1992, before any com- 
plete genome was available. In their pi- 
oneering study, Gonnet and coworkers 
[48] suggested the effectiveness of using 
computer-based methods in a systematic 
fashion to perform exhaustive analyses of 
all identified protein families, in order to 
draw conclusions about essential cellular 
functions. 

During the era of genome sequencing, 
many model organisms have been used 
to estimate the composition of a hypo- 
thetical minimal genome by comparative 
genomics. Some of the most widely used 
species are those of host-dependent sym- 
biotic bacteria, which present naturally 
reduced genomes [49]. As an adaptation 
to their lifestyle, the genomes of mutu- 
alistic and parasitic symbionts undergo a 
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reductive process in which genes that are 
unnecessary in their protected environ- 
ment, or whose functions can be provided 
by the host, tend to be lost. However, they 
must retain all genes involved in essential 
housekeeping functions and a minimum 
amount of metabolic transactions for cel- 
lular survival and replication in their given 
niche. 

The reduced genomes of two parasitic 
bacteria, Haemophilus influenzae (1727 
protein-coding genes) [50] and Mycoplasma 
genitalium (482 protein-coding genes) [51], 
were the first to become available. Soon af- 
terwards, their comparison provided the 
first attempt to reconstruct a minimal 
genome, composed of 256 genes [14]. 
These bacteria belong to two clades sep- 
arated from their last common ancestor 
by at least 1.5 billion years of evolution. 
Therefore, it was assumed that genes 
conserved in their reduced genomes and 
across such phylogenetic distance were 
good candidates to be considered essen- 
tial. The proposed minimal gene-set was 
obtained by identifying orthologous genes 
between the two genomes, taking into 
account non-orthologous gene displace- 
ments (NODs) (i.e., genes that encode 
proteins with a similar function but do 
not share a common ancestor and, there- 
fore, cannot be detected by comparative 
analyses) in order to fill the gaps in path- 
ways that were assumed to be essential, 
and removing any functionally redundant 
or parasite-specific gene. This hypotheti- 
cal minimal set appeared to correspond 
to a plausible minimalist bacterium. More 
than half of the genes were involved in ge- 
netic information storage and processing, 
plus a surprisingly large number of molec- 
ular chaperones. The proposed essential 
gene-set also encodes proteins needed to 
carry on a simplified metabolism, a lim- 
ited amount of transporters and protein 
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Tab. 1 Top-down approaches used to define the minimal gene-set for a modern bacterial cell. 


Approach 


Comparative 
genomics 


Systematic gene 
inactivation 


Massive 
transposon 
mutagenesis 


Antisense RNA 


Organism/model 


Haemophilus influenzae and 
Mycoplasma genitalium 
Five insect endosymbionts 


plus Mycoplasma genitalium 
All of the above plus Rickettsia 


prowazekii and Chlamydia 
trachomatis 


573 bacteria with completely 


sequenced genomes 
93 Gam-negative and 93 

Gram-positive bacteria 
Bacillus subtilis 


Escherichia coli 


Acinetobacter baylyi 
Streptococcus sanguinis 
Staphylococcus aureus 


Mycoplasma genitalium 


Haemophilus influenza 
Mycobacterium tuberculosis 
Pseudomonas aeruginosa 


Helicobacter pylori 
Francisella novicida 
Mycoplasma pulmonis 
Salmonella enterica serovar 
typhimurium 
Staphylococcus aureus 
Salmonella enterica serovar 
typhi 
Caulobacter crescentus 
Porphyromonas gingivalis 
Staphylococcus aureus 


No. of coding genes 


256 


180 


156 


250 


151 


~300 


271 
620 
303 
499 
218 
133 
122 
~300 


382 

478 

614 

300-400 

335 

272-344 

395 

310 
~490 


351 
356 


480 
463 
>150 
658 


Reference(s) 


Mushegian and 
Koonin [14] 
Gil et al. [15] 


Klasson and 
Andersson [16] 


Lapierre and Gogarten 


[17] 
Huang et al. [18] 


Itaya [19] 
Kobayashi et al. [20] 


Gerdes et al. [21] 
Baba et al. [22] 


De Berardinis et al. [23] 


Xu et al. [24] 

Song et al. [25] 

Ko et al. [26] 
Hutchison et al. [27] 


Glass et al. [28] 
Akerley et al. [29] 
Sassetti et al. [30] 
Jacobs et al. [31] 
Liberati et al. [32] 
Salama et al. [33] 
Gallagher et al. [34] 
French et al. [35] 
Knuth et al. [36] 


Chaudhuri et al. [37] 
Langridge et al. [38] 


Christen et al. [39] 
Klein et al. [40] 
Jiet al. [41] 
Forsyth et al. [42] 
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Tab. 1 (Continued) 
Approach Organism/model No. of coding genes Reference(s) 
Comparative 17 environmental and 105 Callister et al. [43] 
proteogenomics pathogenic bacteria 
Acholeplasma laidlawit, 212 Fisunov et al. [44] 
Mycoplasma gallisepticum, 
and M. mobile 
Theoretical E-CELL 127 Tomita et al. [45] 
modeling 
Simplified Mycoplasma 284 Karr et al. [46] 
genitalium 
MCM 218 Shuler et al. [47] 
Comprehensive CMG 207 Gil et al. [7] 
approaches 
Reviewed CMG 211 This work 


export systems, and several genes of 
uncharacterized function. Although later 
comparative analyses rendered different 
gene sets, it is worth noting that the same 
essential functions have always been iden- 
tified. 

Another comparative approach included 
in the comparison the first five avail- 
able small genomes of obligate insect 
endosymbionts: three different strains 
of Buchnera aphidicola, the primary (P-) 
endosymbiont of aphids; Wigglesworthia 
glossinidia, the P-endosymbiont of tse-tse 
flies, and Blochmannia floridanus, the 
P-endosymbiont of carpenter ants) [15]. 
All of these are y-proteobacteria and, due 
to the massive genome reduction they un- 
derwent after the establishment of their 
respective symbioses [52], their genomes 
retain a number of protein-coding genes 
ranging from 508 to 619. The five 
genomes share only 277 protein-coding 
genes (281 if NOD is taken into ac- 
count) [15]. Some of these genes must 
be involved in endosymbiotic processes, 
while the remainder should be essential 


for any type of cellular life. In order to 
identify the latter subset of genes, M. 
genitalium was added to the comparison. 
The study results showed that the six 
genomes shared only 180 housekeeping 
protein-coding genes. Interestingly, and 
in concordance with the previous com- 
parative analysis of M. genitalium and 
H. influenza [14], about one-third of the 
shared genes were devoted to informa- 
tion storage and processing, and there was 
also a considerable number of molecu- 
lar chaperones, revealing the importance 
of these two categories of genes for any 
living cell. As more genome sequences 
were added to the comparisons, the num- 
ber of shared genes decreased, reducing 
the number of genes considered to be 
essential by this approach, as expected. 
Thus, the number of shared genes among 
endosymbionts and parasites was further 
reduced to 156 when the intracellular par- 
asites Rickettsia prowazekii and Chlamydia 
trachomatis were included in the compar- 
ative analysis [16]. 


450 


The Minimal Gene-Set Machinery 


There is a limit in the usefulness of 
extremely reduced genomes for compar- 
ative analysis towards the definition of a 
minimal genome. Nowadays, a large num- 
ber of natural highly reduced genomes are 
available in the databases. The smallest 
genomes identified to date are those of 
Hodgkinia cicadicola, the P-endosymbiont 
of cicadas (144kb in size, encoding 
169 proteins) [53] and Tremblaya prin- 
ceps, the P-endosymbiont of mealybugs 
(139 kb, containing just 116 protein-coding 
genes) [54, 55]. However, these bacteria 
have established consortia with other en- 
dosymbionts, and their extremely reduced 
genomes are unable to carry on some 
of the most basic processes. In fact, T. 
princeps does not even possess a minimal 
functional machinery for DNA replication, 
transcription and translation, being close 
to what could be considered an organelle. 

The availability of massive  se- 
quencing technologies during recent 
years has allowed the performance of 
high-throughput comparative approaches 
to identify what has been called the 
“extended bacterial core genome’ [17]. By 
comparing 573 sequenced genomes, 
Lapierre and Gogarten identified about 
250 genes which belong to gene families 
that are present in at least 99% of the 
sampled genomes, constitute around 8% 
of the genes present in a typical bacterial 
genome, and encode proteins involved in 
basic essential functions such as replica- 
tion, translation and energy homeostasis. 
As an extension of this approach, Huang 
and coworkers [18] compared the extended 
core genome obtained by a comparison of 
92 Gram-negative plus 93 Gram-positive 
bacteria with the M. genitalium genome, 
and identified a core genome composed 
of 151 genes (abbreviated as HCG; 
Huang et al. Core Genomefrom now 
on, for the sake of simplicity). Once 


more, the proposed minimal set included 
mostly genes involved in informational 
processes. However, although the list 
presented by these authors can help in the 
identification of essential genes that were 
missing from previous studies, they did 
not take NOD into account. Additionally, 
the method used was quite conservative, 
as the minimal gene-set only included 
those genes that had a level of homology 
above a threshold, and the authors 
recognized that they may have missed 
essential genes. Furthermore, they did 
not check the final list in order to fill 
the gaps in informational and metabolic 
pathways; nevertheless, they argued that 
those genes identified by their approach 
must be truly indispensable. 

In any case, the number of genes identi- 
fied by comparative genomics is artificially 
small for several reasons. One of the most 
relevant reasons is NOD (as noted above, 
attempts have been made to buffer its 
impact in comparative analyses), because 
many essential cellular functions can be 
performed by the products of alterna- 
tive non-orthologous genes. Furthermore, 
computational analyses are likely to un- 
derestimate the minimal gene-set because 
sequence divergence can be great between 
phylogenetically distant organisms and, 
consequently, the inclusion of genomes 
from very distant taxons in the comparison 
can lead to the exclusion of true orthologs 
from the list of shared genes. Genes with 
a high evolutionary rate might also be 
missed. Additionally, as noted above, there 
is a negative correlation between the num- 
ber of genomes and the number of genes 
shared by them. As indicated by its name, 
a bacterial core genome (extended or not) 
is far from being the minimal gene-set 
needed for a cell to be considered alive. It 
is unrealistic to consider that a free-living 


cell can be sustained by the universally 
conserved genes only. 


2.2 
Experimental Genetics: Mutational 
Approaches to Detect Essential Genes 


Three different experimental approaches 
have been used to identify genes that 
are indispensable for cell viability under 
particular growth conditions in model or- 
ganisms: (i) the systematic inactivation of 
each individual gene present in a genome; 
(ii) massive transposon mutagenesis (the 
most widely used approach); and (iii) the 
use of antisense RNA to inhibit gene ex- 
pression (Table 1). Detailed information 
on each of these strategies is available in 
the Database of Essential Genes (DEG, 
http: //www.essentialgene.org/) [56]. 

The early experimental approximations 
to the minimal genome were based on in- 
direct evidence from random mutagenesis 
or systematic gene inactivation. In fact, the 
first attempt was made before the genomic 
era, based on a study of viability performed 
on a limited number of randomly gener- 
ated gene disruptions in Bacillus subtilis 
[19], which led to an estimated num- 
ber of about 300 essential genes. Several 
genome-wide analyses using this strategy 
have since been conducted on Escherichia 
coli [21, 22], B. subtilis [20] and, more re- 
cently, Acinetobacter baylyi [23]. Although 
each of these studies has helped in the 
characterization of proteins of unknown 
functions, they all have some limitations. 
First, the inactivation of single genes fails 
to identify essential functions encoded by 
redundant genes, a frequently occurring 
situation that has accumulated in living 
systems throughout evolution. To mini- 
mize this effect, Xu and coworkers per- 
formed double-knockouts of paralogous 
or isoenzyme genes in their genome-wide 
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analysis of essential genes in Streptococcus 
sanguinis [24]. A second limitation relates 
to the fact that the essential gene-set is not 
the same as the minimal genome, because 
genes that are individually unnecessary 
may not be simultaneously dispensable, 
and vice versa. An example of the first sit- 
uation is the possibility to use alternative 
pathways to synthesize a single metabo- 
lite. Although, initially, any gene in each 
pathway might be dispensable, once one 
of the pathways has been inactivated all 
of the genes involved in the alternative 
pathway become essential. The second sit- 
uation arises when two or more genes are 
necessary for a function that might not be 
essential, but the accumulation of prod- 
ucts generated by the action of any gene 
is damaging to the cell in the absence 
of another gene. Additionally, single-gene 
inactivation strategies are time-consuming 
and expensive, and more global systematic 
approaches have been developed. 

The mycoplasmas are a group of bac- 
teria that lack a cell wall and have the 
smallest genomes for an organism capa- 
ble of growth under laboratory conditions 
in anexic culture, ie., entirely free of 
any other organisms. For these reasons, 
they have long been considered one of 
the best experimental models to study 
essential genes; indeed, M. genitalium is 
the most widely used bacterium in com- 
parative genomics and experimental ap- 
proaches towards identifying the minimal 
genome [28]. The reduced genome of M. 
genitalium (ca. 580 kb) contains only 482 
protein-coding genes, and a low percent- 
age of paralogous genes (6%) compared to 
free-living bacteria (26% on average). M. 
genitalium was the first bacterium to be 
analyzed by massive transposon mutagen- 
esis, leading to the identification of approx- 
imately 300 essential genes [27]. The basic 
principle of this approach is that genes 
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which can be disrupted by transposons 
must be nonessential, while those which 
cannot be disrupted are candidates to be 
essential. Nevertheless, it was necessary to 
isolate and characterize pure clonal popu- 
lations to prove their dispensability, or not. 
Further experiments conducted by Glass 
and coworkers [28] showed that some es- 
sential genes could tolerate transposon 
insertions, while some nonessential genes 
might delay growth, leading in both cases 
to their incorrect classification. Following 
a careful review of their data, Glass and 
coworkers identified 382 essential genes 
in the M. genitalium genome (Minimal 
M. genitalium Genome; MMG from now 
on). It is worth noting that some of the 
genes included in the first hypothetical 
minimal genome obtained by comparative 
genomics [14] were disrupted in these ex- 
periments, which is an indication that they 
should not be considered essential. 
High-density transposon mutagenesis 
has also been applied to the pathogenic 
bacteria H. influenza [29], Mycobacterium 
tuberculosis [30], Pseudomonas aeruginosa 
[31, 32], Helicobacter pylori [33], Fran- 
cisella novicida [34], and Mycoplasma pul- 
monis [35], leading to varying numbers 
of essential genes (from more than 600 
to close to 300). More recently, mas- 
sive transposon mutagenesis coupled with 
high-throughput sequencing has been 
used to facilitate the identification of essen- 
tial genes [36-40]. The high density of in- 
sertions across the selected genome (more 
than a million transposon mutants can 
be assayed in a single experiment) allow 
a more clear demarcation between essen- 
tial and nonessential genes. Furthermore, 
hypersaturation transposon mutagenesis 
allows also for the detection of the essential 
noncoding and regulatory elements [39]. 
A third systematic experimental strat- 
egy to identify essential genes involves 


the use of antisense RNA; this approach 
has been used on Staphylococcus aureus 
[41, 42]. However, the use of antisense 
RNA is limited to those genes for which 
an adequate expression of the inhibitory 
RNA can be obtained. Allelic replacement 
mutagenesis — an alternative global strat- 
egy for single-gene inactivation — was used 
to identify essential genes that have been 
missed in previous antisense RNA exper- 
iments in S. aureus but were present in 
the core genome identified by comparative 
genomics [25, 26]. 

There is a wide variation in the num- 
ber of essential genes identified by the 
different studies, even between closely re- 
lated species or strains, highlighting the 
fact that the essential gene-set is heav- 
ily dependent on the specific biological 
niche. Another reason is that essentiality 
can be assayed in many different condi- 
tions [57], and in most cases the choice 
is forced by the type of organism that 
is being assayed, which presents differ- 
ent constraints due to its living style and 
its natural genome composition. Studies 
have been performed on different rich me- 
dia with complex mixtures of nutrients, or 
in defined minimal media. Some studies 
tested for colony formation, while others 
were performed on liquid cultures. Ad- 
ditionally, many of these studies did not 
intend to find universally essential genes 
to approach a minimal genome. Assays 
were designed to identify genes essen- 
tial for a particular process of interest, 
such as virulence [33-35] or adaptation 
to specific environments [58]. In some 
studies the aim was to generate a com- 
prehensive collection of mutants for use 
in experimental studies towards the char- 
acterization of genes of unknown function 
[32]. Some others aimed at identifying es- 
sential genes as potential targets of new 
antimicrobial drugs [34], to combat new 


emerging pathogens or antibiotic-resistant 
bacteria [59], and have focused their efforts 
on proving experimentally the essentiality 
of genes that have been shared among 
several fastidious pathogens but have a 
reduced sequence identity with eukaryotic 
genes [60]. Finally, some studies were used 
as a proof-of-principle of the effectiveness 
of the new methods that were being de- 
veloped for high-throughput analyses of 
mutants and, therefore, were not intended 
to be exhaustive [36, 61]. In spite of the 
variation, the identified genes in all of 
these approaches deserve a closer exam- 
ination by experimental means, because 
they can be useful to understand the ba- 
sic biological processes in which they are 
involved and the way in which these pro- 
cesses interact. 


23 
Comparative Proteomics: Preserved versus 
Active Genes 


The advances made in transcriptomics and 
proteomics during recent years have high- 
lighted the fact that the characterization of 
protein-coding genes present in a genome 
is far from providing sufficient informa- 
tion to characterize a living cell. In fact, the 
correct genome annotation and the identi- 
fication of expressed protein-coding genes 
can benefit from the use of proteomic tech- 
niques and comparative proteogenomics 
[62]. Additionally, they will help to identify 
persistent genes — that is, genes that are 
conserved among species but are not es- 
sential — because such genes would proba- 
bly not be expressed and, therefore, would 
not appear in a proteomics study [44]. Sev- 
eral proteogenomics attempts have been 
made to determine the essentiality of 
genes proposed by comparative genomics 
as components of the core genome. Cal- 
lister and coworkers [43] performed a 
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comparative proteomics analysis on 17 
environmental and pathogenic bacteria. 
The selected species lived in different 
environments, were phylogenetically dis- 
tant from each other, and were grown 
under different conditions. As a conse- 
quence, some proteins previously thought 
to be indispensable were excluded from 
the core proteome. However, as it hap- 
pens with comparative genomic analy- 
ses, the inclusion of evolutionary distant 
species in the comparison may impede 
the identification of truly essential gene 
products. In trying to avoid this problem, 
Fisunov and coworkers [44] performed a 
proteogenomics analysis on three phylo- 
genetically close Mycoplasma species with 
small genomes, which can be grown un- 
der the same laboratory conditions even 
though they occupied different ecological 
niches. The aim was to identify the mini- 
mal protein-set required for life, excluding 
proteins involved in niche adaptation and 
stress response, and to minimizing the 
problem of NOD of essential genes. Ge- 
nomic and proteomic cores are largely 
coincident, and the proteomic core is in 
good agreement with the MMG [28]. Only 
26 proteins from the core proteome are 
dispensable according to that previous 
study. Although there are also several es- 
sential proteins that are absent from the 
proposed core proteome, most of these 
were small or membrane proteins that 
might remain elusive in proteome anal- 
ysis, or that were not simultaneously 
present in all three Mycoplasma species 
analyzed. 


2.4 
Minimal Cell Modeling 


Models provide a framework in which 
to consider a system, specifically with 
regards to the interactions within the 
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system and its response to outside per- 
turbations. A theoretical computer-based 
model of a minimal cell can help in the 
understanding of biological systems. The 
massive acquisition of data in molecular 
and cellular biology has allowed the design 
of models for minimal cells, which will 
serve as a frame to solve ambiguities in the 
interpretation of experimental and com- 
parative genomics and proteomics [46, 47, 
63]. Any proposed model must be explicit 
about minimal functions, and also include 
a realistic set of proteins and functional 
RNAs to accomplish these functions. How- 
ever, theoretical models to represent a 
whole cell have an important limiting 
factor: to date, no single computational 
method is sufficient to fully explain com- 
plex phenotypes in terms of molecular 
components and their interactions, and it 
is difficult to determine and model all of 
the parameters involved [46]. 

The first whole-cell model proposed 
was known as the E-CELL [45], and 
was based on a reduced version of the 
genome of M. genitalium, including only 
the 127 genes needed to carry on a mini- 
mal cellular metabolism composed by the 
pathways for glycolysis and phospholipid 
biosynthesis, transcription, and transla- 
tion. However, this virtual cell did not 
include modules to describe other im- 
portant nonmetabolic functions, such as 
chromosome replication or cell division. 
More recently, two whole-cell models have 
been proposed. One of these models is 
also based on the genome of M. geni- 
talium [46] and attempts to describe the 
bacterial life-cycle and predict a wide range 
of cellular behaviors, based on the func- 
tions of every annotated gene product. 
It integrates all such functions in 28 
submodels, which are connected through 
common metabolites, RNA, proteins, and 
the chromosome, through 16 variables. 


This cell model is not minimal in essence, 
as there are genes in this genome that 
were proven to be nonessential. How- 
ever, the model has been validated by 
comparison of the predicted behavior with 
experimental data, and therefore can be 
used to detect which genes can be re- 
moved without compromising the cell 
integrity, growth, and reproduction. The 
Shuler group has formulated another min- 
imal cell model (MCM) with the min- 
imum number of genes necessary to 
grow and divide in an optimally sup- 
portive culture environment [47]. This 
model is based on a “coarse-grained” 
whole-cell model of £. coli previously 
developed by the same group [64], and 
adopted the CMG proposed in 2004 by 
the present author’s group [7] with slight 
differences (as described in Sect. 3). The 
model focuses on essential functions, in- 
corporates mechanisms for most cellu- 
lar processes, and describes explicitly all 
genes. Although it is not chemically de- 
tailed — because otherwise it would not be 
computationally tractable —- it has the ad- 
vantage of being functionally complete and 
modular. 
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All of the above-mentioned comparative 
and experimental studies have shown 
that, even though the essential gene-set 
complement can vary among different 
organisms and environmental conditions, 
there is a set of functions universally 
represented in any living form, including a 
limited amount of metabolic transactions 
that are needed to survive in a nutrient-rich 
and stable environment, even though 
different genes can be recruited to perform 
those functions. Therefore, the definition 


of a minimal cell can be approached by 
searching for the necessary and sufficient 
genes to perform well-defined pathways 
that are needed to perform such essential 
functions. 

Any living system relies on three essen- 
tial biological pillars: genetic machinery; 
energetic metabolism; and an envelope 
that encloses both of these and controls 
the bidirectional flow of energy and mat- 
ter for the benefit of the system [24, 65). 
Almost a decade ago, attempts were made 
to define the CMG by performing a com- 
prehensive analysis of all previously used 
strategies for addressing this subject. A 
careful review of genes involved in the 
primary metabolism was also made, in 
order to propose a simplified metabolic 
chart capable of providing energy and ba- 
sic components for a minimal living cell 
[7]. The catalog of functions that must be 
necessary and sufficient for a living cell 
were grouped into five main categories: 
(i) information storage and processing; (ii) 
protein processing, folding, and secretion; 
(iii) cellular processes; (iv) energetic and 
intermediary metabolism; and (v) poorly 
characterized genes. The MCM model pro- 
posed by Shuler and coworkers [47] was 
based on a modified version of the CMG. 
In light of all new analyses that have 
been reviewed on the previous section, 
especially the extended comparative anal- 
ysis performed by Huang and coworkers 
(HCG) [18], and the experimental studies 
of Glass and coworkers on M. genitalium 
(MMG) [28], a refined minimal gene-set 
machinery can be proposed (Tables 2-4; 
Figs 1 and 2). The reviewed gene-set ma- 
chinery proposed also incorporates RNA 
genes that are needed for correct cellu- 
lar functioning. All genes included have 
a known essential function, although in 
some cases it has been difficult to sharply 
define if a function can be performed 
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by the products of a minimal amount 
of genes or, instead, it would be more 
appropriate to add some additional ones 
to improve performance. Therefore, this 
refined version contains a small amount 
of genes (18 coding for proteins and three 
RNA genes) that might not be essential 
components of the minimal gene-set ma- 
chinery. Functional information has been 
mainly retrieved from the EcoCyc database 
(http://ecocyc.org/). The gene names are 
based on those of E. coli. The encoded 
enzymes are identified by their Enzyme 
Commission (EC) number. The distribu- 
tion of genes in the main categories is 
summarized in Fig. 1. 


3.1 

The First Pillar. Storage and Processing of 
Genetic Information: From Genes to Gene 
Products 


All different strategies used to define 
a minimal gene-set are in agreement, 
in that most genes which need to be 
included are devoted to DNA metabolism, 
transcription, and protein synthesis and 
processing (Table 2). 


3.1.1 DNA Metabolism 

A basic DNA replication system must 
be able to perform four basic steps: (i) 
recognition of the replication origin; (ii) 
initiation of the process by recruiting the 
replisomal proteins at the origin; (iii) du- 
plication of both DNA strands; and (iv) 
termination of the process and separation 
of the daughter DNA molecules. Several 
endosymbiotic bacteria do not possess any 
protein to recognize the replication origin, 
which indicates that they are not essential 
under some conditions. For this reason, 
no genes involved in this first step were 
included in the CMG [7]. However, Huang 
and coworkers [18] included DnaA in the 
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Tab. 2. Minimal gene-set machinery for storage and processing of genetic information. 


GENE? srs CMG, HCG | MMG Gene product 
number 
DNA bo 
B p ) 
_ _ x x Chromosomal replication initiator protein DnaA 
dnaB 3.6.1.- x x x Replicative DNA helicase 
dnaE QT b x x DNA polymerase II, & subunit 
dnaG 2.7.7. x x x DNA primase 
dnaN 2.7.7.7 x x x DNA polymerase III, B subunit 
dnaQ = x x DNA polymerase III, € subunit 
dnaX 2.7.7.7 x x DNA polymerase III, y and t subunits 
gytA 5.99.1.3 x x x DNA gyrase, A subunit 
gytB 5.99.1.3 x x x DNA gyrase, B subunit 
2.7.7.7 x DNA polymerase III, 6 subunit 
holB 2.7.7.7 ¥ x DNA polymerase III, 6! subunit 
hupA _ x x DNA-binding protein 
lig 6.5.1.2 x x x DNA ligase (NAD-dependent) 
ssb x N N Single-strand DNA-binding protein 


DNA repair 


nth 4.2.99.18 x = Endonuclease III 
polA 3.1.11.- i x x 5'-3' Exonuclease domain of DNA polymerase I 
ung 3.2.2.- x Uracil-DNA glycosylase 
Basic transcription machinery 
deaD = x x x ATP-dependent RNA helicase 
greA = x x x Transcription elongation factor 
nusA —_ x x x Transcription translation coupling 
nusG a x x x Transcription antitermination protein 
rpoA 2.7.7.6 x x x RNA polymerase, o subunit 
rpoB 27:76: x x X RNA polymerase, B subunit 
tpoC 2.7.7.6 x x x RNA polymerase, B' subunit 
rpoD _ x x x RNA polymerase major o factor 


Aminoacyl-tRNA synthesis 


alaS 6.1.1.7 x x x Alanyl-tRNA synthase 

args 6.1.1.19 x x x Arginyl-tRNA synthase 

asnS 6.1.1.22 x x x Asparaginyl-tRNA synthase 
aspS 6.1.1.12 x x x Aspartyl-tRNA synthase 

cysS 6.1.1.16 x x x Cysteinyl-tRNA synthase 

gins 6.1.1.18 x = _ Glutaminyl-tRNA synthase 

gltX 6.1.1.17 x x x Glutamyl-tRNA synthase 

glyS 6.1.1.14 x _ x Glycyl-tRNA synthase, b subunit 
hisS 6.1.1.21 x x x Histidyl-tRNA synthase 

ileS 6.1.1.5 x x x Isoleucyl-tRNA synthase 

leuS 6.1.1.4 x x x Leucyl-tRNA synthase 

lysS 6.1.1.6 x x x Lysyl-tRNA synthase 

metS 6.1.1.10 x x x Methionyl-tRNA synthase 

phes 6.1.1.20 x x x Phenylalanyl-tRNA synthase, a subunit 
pheT 6.1.1.20 x x x Phenylalanyl-tRNA synthase, b subunit 
pros 6.1.1.15 x _ x Prolyl-tRNA synthase 

serS 6.1.1.11 x x x Seryl-tRNA synthase 

thrS 6.1.1.3 x — x Threonyl-tRNA synthase 

trpS 6.1.1.2 x x x Tryptophanyl-tRNA synthase 
tyrS 6.1.1.1 x x x Tyrosyl-tRNA synthase 

val 6.1.1.9 x x x Valyl-tRNA synthase 


tRNA genes 
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Tab. 2. (Continued) 
tRNA-Ala(GGC) tRNA-Ala, anticodon GGC 
tRNA-Ala(UGC) tRNA-Ala, anticodon UGC 
tRNA-Arg(ACG) tRNA-Arg, anticodon ACG 
tRNA-Arg(CCG) tRNA-Arg, anticodon CCG 
tRNA-Arg(UCU) tRNA-Arg, anticodon UCU 
tRNA-Asn(GUU) tRNA-Asn, anticodon GUU 
tRNA-Asp(GUC) tRNA-Asp, anticodon GUC 
yi tRNA-Cys, anticodon GCA 
tRNA-GIn(UUG) tRNA-GIn, anticodon UUG 
tRNA-Glu(UUC) tRNA-Glu, anticodon UUC 
tRNA-Gly(GCC) tRNA-Gly, anticodon GCC 
tRNA-Gly(UCC) tRNA-Gly, anticodon UCC 


tRNA-His(GUG) 


tRNA-His, anticodon GUG 


tRNA-Ile, anticodon GAU 


tRNA-Ile, anticodon CAU 


tRNA-Leu(GAG) 


tRNA-Leu, anticodon GAG 


tRNA-Leu(UAG) 


tRNA-Leu, anticodon UAG 


tRNA-Leu(UAA) 


tRNA-Leu, anticodon UAA 


tRNA-Lys(UUU) 


tRNA-Lys, anticodon UUU 


tRNA-Met(CAU) 


tRNA-Met, anticodon CAU 


tRNA-Phe(GAA) 


tRNA-Phe, anticodon GAA 


tRNA-Pro(UGG) 


tRNA-Pro, anticodon UGG 


tRNA-Pro, anticodon GGG 


tRNA-Ser(GGA) 


tRNA-Ser, anticodon GGA 


tRNA-Ser(UGA) 


tRNA-Ser, anticodon UGA 


tRNA-Ser(GCU) 


tRNA-Ser, anticodon GCU 


tRNA-Thr(GGU) tRNA-Thr, anticodon GGU 
tRNA-Thr(UGU) tRNA-Thr, anticodon UGU 
tRNA-Trp(CCA) tRNA-Trp, anticodon CCA 
tRNA-Tyr(GUA) tRNA-Tyr, anticodon GUA 
tRNA-Val(GAC) tRNA-Val, anticodon GAC 
tRNA-Val(UAC) tRNA-Val, anticodon UAC 
tRNA maturation and modification 
iscS 4.4.1- x _— Cysteine desulfurase-NifS homolog 
mnmA 2.1.1.61 x x tRNA (5-methylaminomethyl-2-thiouridylate) 
methyltransferase 
mnmE 3.6: x x = GTP-binding protein involved in biosynthesis of 5- 
methylaminomethyl-2-thiouridine 
mnmG —_ x _— x Glucose inhibited division protein A, involved in 
biosynthesis of 5-methilaminomethyl-2-thiouridine 
pth 3.1.1.29 x peptidyl-tRNA hydrolase 
mpA 3.1.26.5 _— protein component of ribonuclease P 
rmpB = = = _ RNA component of ribonuclease P 
tilS (mesj) 6.3.4.19 x ba x tRNA (Ile)-lysidine synthetase 
trmD 2.1.1.228 —_ x x tRNA(guanine-N(1)-)-methyltransferase 
Ribosomal components 
fA — _— _ 5S ribosomal RNA 
isA = = — 16S ribosomal RNA 
mA _ = _ 23S ribosomal RNA 
rplA x x x 50S ribosomal protein L1 
rplB x x x 50S ribosomal protein L2 
rplC x x x 50S ribosomal protein L3 
rpID x x x 50S ribosomal protein L4 
is art x x x 50S ribosomal protein L5 
rplF x x x 50S ribosomal protein L6 
x = x 50S ribosomal protein L9 
x x x 50S ribosomal protein L10 
x x 50S ribosomal protein L11 
x x 50S ribosomal protein L12 


(continued overleaf) 
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Tab. 2. (Continued) 
rpIM x xX 50S ribosomal protein L13 
rpIN x x x 50S ribosomal protein L14 
x _ x 50S ribosomal protein L15 
x x x 50S ribosomal protein L16 
x x x 50S ribosomal protein L17 
x bs x 50S ribosomal protein L18 
x x be 50S ribosomal protein L19 
x x x 50S ribosomal protein L20 
x x 50S ribosomal protein L21 
x x x 50S ribosomal protein L22 
x _ x 50S ribosomal protein L23 
x x 50S ribosomal protein L24 
x x 50S ribosomal protein L27 
x — x 50S ribosomal protein L28 
x — x 50S ribosomal protein L29 
rpmE x x x 50S ribosomal protein L31 
x = x 50S ribosomal protein L32 
rpmG x — x 50S ribosomal protein L33 
x = x 50S ribosomal protein L34 
x = > 50S ribosomal protein L35 
x = x 50S ribosomal protein L36 
rpsB x x x 30S ribosomal protein $2 
rpsC x x x 30S ribosomal protein $3 
rpsD x x x 30S ribosomal protein $4 
rpsE x x x 30S ribosomal protein $5 
IpsF x — x 30S ribosomal protein S6 
IpsG x x x 30S ribosomal protein S7 
rpsH x x x 30S ribosomal protein $8 
rpsI x x x 30S ribosomal protein $9 
rps] x x x 30S ribosomal protein $10 
rpsK x x x 30S ribosomal protein $11 
rpsL x x x 30S ribosomal protein $12 
rpsM x x x 30S ribosomal protein $13 
rpsN x _— x 30S ribosomal protein $14 
rpsO x x x 30S ribosomal protein $15 
rpsP x x x 30S ribosomal protein $16 
rpsQ x x x 30S ribosomal protein $17 
rpsR x x xX 30S ribosomal protein $18 
rpsS x x x 30S ribosomal protein $19 
x _ x 30S ribosomal protein $20 
Ribosome function, maduration and modification 
cspR 2.1.1.- x _ —_ Ribosomal methytransferase 
engA _ x x x GTP-binding protein 
era _ x = x GTP-binding protein 
2.1.1.- x x — Dimethyladenosine transferase 
mraW 2.1.1.- x x x 6S rRNA methyltransferase 
obg = x x x GTP-binding protein 
rbfA _ x = x Ribosome-binding factor A 
rsml (yraL) —_ x x —_ 6S rRNA methyltransferase 
— x _ x Conserved protein involved in translation 
ychF _ x x _ GTP-binding protein 
Translation factors 
efp _ x = x Elongation factor P 
fusA 3.6.1.48 x x Elongation factor G 
fir = x x Ribosome recycling factor 
hemK 2.1.1- x _ x N5-glutamine methyltransferase, modulation of 
release factors activity 
infA — x x nitiation factor IF-1 
infB = x nitiation factor IF-2 
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Tab. 2. (Continued) 
infC _ x x x Initiation factor IF-3 
lepA = x x = GTP-binding elongation factor 
prfA _ x x Peptide chain release factor 1 (RF1) 
smpB = x x tmRNA-binding protein 
sstA tmRNA 
tsf = x Elongation factor Ts 

tufA 3.6.5.3 x Elongation factor Tu 

RNA degradation 
pup 2.7.7.8 x _ — Polyribonucleotide nucleotidyltransferase 
mc 3.1.26.3 x x —_ Ribonuclease III 

O pro g oid g dg O 

Pro po O od O 
map 3.4.11.18 Methionine aminopeptidase 
pepA 3.4.11.1 Aminopeptidase A/I 


x — x 
Protein folding 


dnaJ x x x Hsp70 co-chaperone 
dnaK x x x Chaperone Hsp70 
groEL x x x Class I heat-shock protein 
groES x _ x Class I heat-shock protein 
grpE x as x Hsp70 co-chaperone 
O a O O dG O 
fth x x x Protein component of signal recognition particle (SRP) 
ffs — — — 4.58 RNA, RNA component of SRP 
fisY x x x SRP receptor 
lepB _— _— Signal peptidase I 
secA x x x Preprotein translocase subunit (ATPase) 
secE x — x Membrane-embebed preprotein translocase subunit 
secY x x x Membrane-embebed preprotein translocase subunit 


Protein turnover 


gcp 3.4.24.57 x x x Probable O-sialoglycoprotein endopeptidase 
hflB 3.4.24.- x x x ATP-dependent protease 
Jon 3.4.21.53 x x x ATP-dependent protease La 


“Yellow cells contain genes that might not 


HCG as the gene encoding the initiation 
protein, as it is present in all genomes that 
they examined. It is possible that small en- 
dosymbiotic genomes encode unidentified 
proteins to perform this function or that, 
given their host-dependent lifestyle, their 
hosts are somehow controlling this pro- 
cess. Therefore, it is a matter of debate if 
this gene will be needed in a minimal cellu- 
lar machinery. Leaving that aside, the min- 
imal DNA replicative machinery must be 
composed at least by one helicase (DnaB, 
EC 3.6.1.-) which attracts the primase 
(DnaG, EC 2.7.7.-) to the replication fork, to 
start the replication by the action of a DNA 
polymerase (EC 2.7.7.7). In present-day 
bacteria, the complex DNA polymerase III 


e essential. RNA genes are depicted in blue. 


holoenzyme is in charge of this process, 
although some of their subunits seem to 
be dispensable. In most endosymbiotic 
bacteria with reduced genomes, the preini- 
tiation complex, which is needed to load 
the 6 processivity clamp (DnaN) at the 
correct place to start replication, is com- 
posed of DnaX, HolA, and HolB (r, 64, 
and 6’ subunits, respectively). However, 
holA is dispensable in M. genitalium [28] 
and it is not included in the HCG [18]. 
In the next step, the core polymerase, 
composed of DnaE and DnaQ (a and ¢ 
subunits, respectively), binds to the pro- 
cessivity clamp to perform replication. The 
process also requires the action of DNA lig- 
ase (Lig, EC 6.5.1.2) and gyrase (GyrA and 
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Tab. 3. Minimal gene-set machinery for energetic and intermediary metabolism. 


GENE| E.C. number | CMG| HCG] MMG Gene product 

eno 4.2.1.11 x x x Enolase 

fbaA 4.1.2.13 x = x Fructose-1,6-bisphosphate aldolase 
gapA 1.2.1.12 x x x Glyceraldehyde 3-phosphate dehydrogenase 
gpmA 5.4.2.1 x _ x Phosphoglycerate mutase 

Idh 1.1.1.27 x _ x L-lactate dehydrogenase 

ptkA 2.7.1.11 x _ x 6-phosphofructokinase 

pgi 5.3.1.9 x x Glucose-6-phosphate isomerase 

pgk 2.7.2.3 x x Phosphoglycerate kinase 

pykA 2.7.1.40 x —_ x Pyruvate kinase 

tpiA 5.3511 x x x Triose phosphate isomerase 
Penthose phosphate pathway 

sph 3.1.3.37 = = = sedoheptulose-1,7-bisphosphatase 

rpe 5.1.3: x x Ribulose-phosphate 3-epimerase 
rpiA 5.3.1.6 x _ x Ribose 5-phosphate isomerase 

tkt 2.2.1.1 x x —__| Transketolase 


Lipid metabolism 


cdsA 2.7.7.41 x _ _— Phosphatidate cytidylyltransferase 
fadD 6.2.1.3 x = —_| Acyl-CoA synthase 
gpsA 1.1.1.94 x = —__| sn-glycerol-3-phosphate dehydrogenase 
pisB 2.3:1.15 x — — | sn-glycerol-3-phosphate acyltransferase 
plsC 2.3.1.51 x _ x 1-acyl-sn-glycerol-3-phosphate acyltransferase 
psd 4.1.1.65 x = = Phosphatidylserine decarboxylase 
pssA 2.7.8.8 x _ _ Phosphatidylserine synthase 
Biosynthesis of nucleotides 
adk 2.7.4.3 x x x Adenylate kinase 
dcd 3.5.4.13 x = —__| dCTP deaminase 
“gmk? | 2.7.4.8 x x x Guanylate kinase 
hpt x _ x Hypoxanthine phosphoribosyltransferase 
ndk&_| 2.7.4.6 x _ —__| Nucleoside diphosphate kinase 
nrdE 1.17.4.1 x x x Ribonucleoside-diphosphate reductase (major subunit) 
nrdF 1.17.4.1 x = x Ribonucleoside-diphosphate reductase (minor subunit) 
ppa 3.6.1.1 x _ x Inorganic pyrophosphatase 
prsA 2.7.6.1 x x x phosphoribosylpyrophosphate synthase 
pyrG 6.3.4.2 x _— —_ CTP synthase 
thyA 2.1.1.45 x — —__| Thymidylate synthase 
| 2.7.4.9 x a x Thymidylate kinase 
x x x Thioredoxin 
1.8.1.9 x x x Thioredoxin reductase 
upp 2.4.2.9 x = x Uracil phosphoribosyltransferase 


Biosynthesis of cofactors 


coaA x = = Pantothenate kinase 

coaD x _ —__| 4'-phospho-pantetheine adenylyltransferase 
coaE x = _ Dephosphocoenzyme A kinase 
dfp x = = Phosphopantothenate cysteine ligase, 

4'-phospho-pantothenyl-L-cysteine decarboxylase 

folA 1.5.1.3 x _ x Dihydrofolate reductase 

glyA 2.1:2:1 x x x Glycine hydroxymethyltransferase 

metK x x x Methionine adenosyltransferase 

nadR 2.7.7.1 x _ —__| Adenilyl transferase 

nadV x = = Nicotinamide phosphoribosyltransferase 

pdxY x = _ Pyridoxal kinase 

ribF x x x Riboflavin kinase, FMN adenylyltransferase 
ylos x = —__| Thiamine pyrophosphokinase 

* Yellow cells contain genes that might not be essential. 

ba single diphosphate kinase, encoded by a modified adk gene, might be sufficient to phosphorilate 


all nucleotide monophosphates. 
“Ndk could be substituted by a modified PykA to phosphorilate all nucleotide diphosphates, in 
addition to its role in glycolysis. 
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Tab. 4 Minimal gene-set machinery for cell envelope. 


GENE) E.C. number | CMG |HCG) MMG Gene product 

Cell division 

fisZ x x x cytoskeletal cell division protein 

secA x x x preprotein translocase subunit (ATPase) 

secE x x membrane-embebed preprotein translocase subunit 
secY x x x membrane-embebed preprotein translocase subunit 
Proton-motive force generati 

atpA 3.6.3.14 x x x ATP synthase a chain 

atpB 3.6.3.15 x x ATP synthase A chain 

atpC 3.6.3.16 x x ATP synthase e chain 

atpD 3.6.3.17 x x x ATP synthase b chain 

atpE 3.6.3.18 x x ATP synthase C chain 

atpF 3.6.3.19 x x ATP synthase B chain 

atpG 3.6.3.20 x x x ATP synthase g chain 

atpH 3.6.3.21 x x ATP synthase d chain 

yidC x x essential for proper integration of ATPase into the membrane 
Transport 

fadD* 6.2.1.3 x acyl-CoA synthase 

pitA x low-affinity inorganic phosphate transporter 

ptsG 2.7.1.69 x x phosphotransferase system (PTS) glucose-specific enzyme II 
ptsH ba x histidine-containing phosphocarrier protein of 

the phosphotransferase system (PTS) 
ptsI x x x phosphotransferase system (PTS) enzyme I 


* The genes that appear in grey have also been included in other categories, but are indicated on the table 


because their protein products are located in the membrane. 


GyrB, EC 5.99.1.2), which can also act as 
a topoisomerase to help in separating the 
daughter DNA molecules. Although topA 
was also included in the HCG [18], it is ab- 
sent in several endosymbiont genomes se- 
quenced so far, and it is known that a mod- 
ified gyrase can complement the lack of 
topA [66, 67]. Additionally, single-stranded 
DNA-binding protein (SSB) is needed to 
stabilize single-strand DNA during DNA 
replication, and at least one histone-like 
protein would be necessary to preserve 
the nucleoid organization [68]. HupA was 
suggested for this purpose in the CMG, as 
it is present in many analyzed endosym- 
biotic bacteria with reduced genomes 
and in M. genitalium [7]. Finally, MraW, 
which was originally included in the CMG 
as a conserved but poorly characterized 
essential methyltransferase, was also in- 
cluded in the MCM model [47] as the 
enzyme responsible for DNA methylation, 


which is an essential step for the initiation 
of DNA replication in E. coli. MraW has 
recently been characterized as being re- 
sponsible for the methyl modification of 
16S rRNA at the P site of the ribosome [69]. 
Therefore, whether it is needed for DNA 
or 16S RNA methylation, it appears to be 
an essential component of the minimal 
gene-set machinery. 

DNA repair and recombination can be 
considered an accessory mechanism for 
a minimal cell living in a highly protected 
environment. However, considering that 
single strand breaks are common during 
DNA replication, it seems appropriate 
to include in the minimal gene-set 
machinery at least genes encoding 
polymerase I (polA), an endonuclease 
to repair apurinic or apyrimidinic sites 
(nth), and uracyl-DNA glycosylase (ung). 
These three genes have been preserved 
in most naturally reduced genomes, even 
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though they have lost most of the repair 
machinery. The presence of these genes 
would allow the cell to maintain the rate 
of mutation at tolerable levels for viability 
[47]. Although Huang and coworkers 
[18] found that the three components of 
the nucleotide excision repair complex 
(uvrA, uvrB, and uvrC), and two genes 
involved in DNA recombination (recA 
and ruvB) are also present in all their 


Cell envelope 


= 


= DNA metabolism 17 
mRNA metabolism 148 
™ Protein 
processing, 17 
folding, and 
secretion 


Gene distribution of the minimal gene-set machinery in the main functional categories. 


compared genomes, all of them are absent 
in endosymbionts with reduced genomes 
and, therefore, they do not appear to be 
essential for minimal cell survival. 


3.1.2 RNA Metabolism 

RNA metabolism is the most evolutionar- 
ily conserved part of modern cell physiol- 
ogy. It involves the synthesis, processing 


> 


Fig. 2. Minimal cell machinery. The names 

in the boxes indicate the genes responsible 
for a given reaction. The proposed minimal 
cell would obtain from the environment the 
basic components (in blue): glucose as a car- 
bon and energy source and for carbohydrate 
metabolism (yellow boxes); fatty acids for 
phospholipid biosynthesis (green boxes); three 
nitrogenous bases (adenine, guanine, uracil) 
for nucleotide biosynthesis (red boxes); amino 
acids for protein biosynthesis (metabolic path- 
way not indicated, as it is part of the genetic 
information processing), and vitamins (fo- 
late, nicotinamide, pantothenate, pyridoxal, 
riboflavin, and thiamine) to synthesize all co- 
factors needed for correct enzyme functioning 
(blue boxes). The use and generation of ATP 
and reductive power (NADH) are indicated. 


The essential genes associated with storage 
and processing of genetic information are not 
indicated, for the sake of simplicity. Genes fol- 
lowed by an asterisk could be substituted by 
other essential genes (a modified Adk can act 
instead of Gmk and Tmk, while PykA can act 
instead of Ndk; see Sect. 3.2 for details). Ab- 
breviations (besides the accepted symbols): 
CDP-DAG, CDP-diacylglycerol; DHAP, dihy- 
droxyacetone phosphate; DHF, dihydrofolate; 
M-THF, methylene-tetrahydrofolate; PEP, phos- 
phoenolpyruvate; PP, pyridoxal phosphate; 
PRPP, phosphoribosyl pyrophosphate; SAM, 
S-adenosylmethionine; THF, tetrahydrofolate. 
Stars indicate the use of energy in the form 
of ATP (red star), CTP (orange star), or GTP 
(pink star). 
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and modification of transcripts, transla- 
tion, RNA degradation, and regulation. 
In this proposal of a minimal gene-set 
machinery, it would be carried out by 
an ensemble of between 98 and 111 
protein-coding genes and 34 to 37 RNA 
genes (Table 2). 

The basic transcription machinery must 
contain a core RNA polymerase (EC 
2.7.7.6; rpoA, rpoB, and rpoC) plus, at 
least, one general sigma factor (rpoD), as 
well as transcription elongation (greA and 
nusA) and termination (nusG) factors. The 
classical regulatory mechanism for gene 
expression in bacteria involves the con- 
trol of transcription initiation by sigma 
and specific transcription factors. How- 
ever, most bacteria with reduced genomes 
have lost most of these, and therefore no 
genes involved in this function are usually 
incorporated in minimal gene-sets. The 
inclusion of a single sigma factor also im- 
plies that all gene promoters must have 
binding sites for such a factor. The gene 
yqgF, which was included as a an essen- 
tial but poorly characterized gene in the 
CMG, seems to be involved in transcrip- 
tion anti-termination at Rho-dependent 
terminators [70]. However, as no transcrip- 
tion anti-termination factors have been 
included in the minimal gene-set ma- 
chinery, it is not present in all genomes 
analyzed by Huang and collaborators [18], 
and it is not essential in M. genitalium [28], 
it has been removed from this reviewed 
version. 

The basic translation machinery is, by 
far, the most complex part of a modern 
minimal cell. It must include a com- 
plete set of aminoacyl-tRNA synthetases 
(EC 6.1.1.-) and tRNAs, all essential ri- 
bosomal elements (rRNAs and proteins), 
translation factors, plus a set of proteins 
involved in the maturation of the different 
components. 


Not all aminoacyl-tRNA synthetases are 
identified as essential by comparative ge- 
nomics approaches, due to the divergence 
of some of them in the analyzed genomes 
[18]. However, as all of them are needed for 
protein synthesis, the complete set can be 
considered essential. The gene ycfF (also 
called hinT), included in the CMG as a 
conserved poorly characterized gene, has 
been identified as a potential regulator of 
lysyl-tRNA synthase and possibly of other 
aminoacyl-tRNA synthases [71]. However, 
as no regulators have been added to the 
minimal gene-set machinery, it has been 
excluded from this reviewed version. 

Formylation of the initiator methionyl- 
tRNA was previously considered essential 
for the initiation of protein synthesis in 
bacteria. However, it has been shown 
that P. aeruginosa can perform protein 
synthesis in the absence of this activity 
[72], and mutants simultaneously lacking 
formylation and deformylation enzymes 
have been obtained in several species (for 
a review, see Ref. [7]). Therefore, these 
genes have not been included in the 
minimal gene-set machinery, even though 
the former was identified in all species 
analyzed by Huang and coworkers [18]. 

Minimal gene-sets usually do not take 
into account RNA genes; however, it is 
obvious that rRNAs and tRNAs are essen- 
tial parts of a minimal translatome. The 
definition of the minimal set or rRNA 
genes is straightforward, because all bac- 
teria possess at least one copy or the genes 
for 16S, 23S, and 5S RNAs. However, it 
is less simple to define a minimal set of 
tRNAs needed to decode the 61 possible 
sense codons. Additionally, modern cells 
devote a significant amount of resources 
to post-transcriptional rRNA and (mostly) 
tRNA modifications. A global survey of 
tRNA modification enzymes [73] revealed 
that the functional constraints which lead 


to the different modifications are often 
conserved, but the adopted solutions differ 
greatly. Moreover, it is difficult to predict 
consistently the presence or absence of a 
specific modification based on the pres- 
ence or absence of a homolog of an experi- 
mentally characterized tRNA-modification 
gene because, depending on the organ- 
ism, different enzymes can perform the 
same modification, while members of the 
same protein family can catalyze differ- 
ent reactions. Furthermore, even though 
tRNA modifications affect decoding and 
tRNA structure and stability, only a few 
are essential. It has been proposed that 
as few as 32 tRNAs might be sufficient to 
translate the entire genetic code accurately, 
provided that a set of 12 modifying wobble 
activities are also encoded in the genome 
[74]. These are the tRNAs that have been 
included in the minimal gene-set machin- 
ery. Only 29 of such tRNAs are present 
in the small genome of M. capricolum, 
although the authors consider that, due 
to the modified genetic code used by my- 
coplasmas, its minimal tRNA machinery 
might not be compatible with a canon- 
ical translation apparatus. Nevertheless, 
comparative analyses of five B. aphidicola 
strains with different degrees of genome 
reduction have revealed that also in this 
case 29 tRNA genes are conserved among 
them [75]. Therefore, the other three tR- 
NAs proposed in Forster and Church’s list 
[74] are labeled as putatively nonessen- 
tial in Table 2. With regards to genes 
involved in tRNA maturation and modi- 
fication, only nine of them have been pre- 
served in naturally reduced genomes that 
still encode a whole translatome. RNAse 
P, a riboprotein composed by the RnpA 
protein plus the catalytic RnpB RNA, is 
needed for the processing of tRNA pre- 
cursor molecules. iscS encodes a cysteine 
desulfurase (EC 4.4.1.-), and is involved in 
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2-thiouridine biosynthesis, a modification 
that stabilizes anticodon structure, confers 
ribosome binding ability to tRNA, and im- 
proves open reading frame maintenance. 
MnmA (EC 2.1.1.61), MnmE (EC 3.6.-.-), 
and MnmG are involved in the biosyn- 
thesis of the hypermodified nucleoside 
5-ethylaminomethyl-2-thiouridine, which 
is found in the wobble position of some 
tRNAs. trmD, the gene encoding a G37 
methyltransferase (EC 2.1.1.228), was ini- 
tially not included in the minimal gene set; 
however, it appears that it should be added 
to the list, based on a careful review of its 
presence in natural reduced genomes, its 
preservation in all genomes examined by 
Huang and coworkers [18], and its identifi- 
cation as an essential gene in M. genitalium 
[28]. tS (annotated as mesJ, a poorly char- 
acterized gene considered essential in the 
CMG), has been added to this category, 
because it encodes the tRNA"*-lysidine 
synthetase (EC 6.3.4.19) responsible for 
modifying the wobble base of the CAU 
anticodon of tRNA!© [76]. pth encodes 
a peptidyl-tRNA hydrolase (EC 3.1.1.29), 
involved in recycling peptidyl-tRNAs re- 
leased from stalled ribosomes. This list of 
genes might not be enough to produce a 
set of mature and functional tRNAs in vivo. 
The bacteria with highly reduced genomes 
used in the comparative analyses that led 
to the selection of these genes have AT-rich 
genomes and present a strong codon bias, 
and their minimal translatome might not 
be universally efficient. However, it does 
not seem to be appropriate to add arbitrar- 
ily more genes to this list, provided that 
even in the widely studied E. coli, some 
tRNA modification genes remain uniden- 
tified [73] and that, as mentioned above, 
different modifications can take place de- 
pending on the specific tRNA structure in 
a given organism. Experimental in vitro 
analyses should be performed in order to 
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define the correct composition of the mini- 
mal gene-set of tRNAs and their modifying 
enzymes. 

Only 36 of the 54 ribosomal proteins 
found in E. coli were included in the HCG 
[18]. However, these are short proteins 
and, as mentioned above, the identifica- 
tion of distant orthologs might be diffi- 
cult in comparative approaches. In 2005, 
Mushegian reconstructed a protein com- 
plement of the putative minimal ribosome 
composed of 38 proteins, using a com- 
bined comparative analysis of sequences, 
structures, and phyletic patterns from all 
three domains of life, and taking into 
account NOD [77]. However, the function- 
ality of ribosomes composed only by these 
proteins has not been proven, and the pro- 
posed gene-set is not coincident with the 
HCG [18]. Therefore, and similarly to the 
CMG [7], the decision was taken to in- 
clude in the minimal gene-set machinery 
the same all 50 ribosomal proteins identi- 
fied in M. genitalium (although those that 
do not appear in Mushegian’s minimal 
protein complement have been labeled 
as putative nonessential in Table 2). It is 
known that some ribosomal proteins are 
post-translationally modified, which might 
be relevant for correct ribosomal structure 
and/or function [78]. However, only a lack 
of methylation of the ribosomal protein L3 
leads to a differential phenotype, and it is 
not lethal. Therefore, none of the genes 
involved in ribosomal proteins modifica- 
tions have been included in the minimal 
gene-set machinery. 

Although rRNA maturation and ribo- 
some assembly is a complex process, not 
all modifications that take place at some 
stage of the process are essential [78], 
which makes it difficult to identify the es- 
sential enzymes required. First of all, the 
16S, 23S, and 5S rRNAs are usually synthe- 
sized as one primary transcript that must 


be processed by RNase III, encoded by rnc. 
No other RNases involved in rRNA matu- 
ration have been identified as essential by 
comparative genomics. Once the rRNAs 
have been processed, they undergo chem- 
ical modifications. Although it has been 
possible to assemble 30S subunits with in 
vitro-transcribed 16S rRNA without modi- 
fications, these particles show a decreased 
tRNA-binding capacity, the ribosome as- 
sembly occurs very slowly and requires 
nonphysiological conditions [79, 80]. The 
CMG included three 16S rRNA methyl- 
transferases [7]. KsgA was chosen because 
it was present in all reduced genomes an- 
alyzed. It is also present in the HCG [18], 
although it has proven to be dispensable 
in M. genitalium [28] and might there- 
fore not be essential. Two additional 16S 
rRNA methyltransferases — RsmI (previ- 
ously called YraL) and MraW (which is 
involved in fine-tuning of the ribosomal 
decoding center [69]) - were included in 
the CMG because they are widely dis- 
tributed among highly reduced genomes, 
although at that time their function had 
not been characterized. Additionally, rbfA 
was also included in the CMG and was 
proven to be essential in M. genitalium 
[28]. It encodes a protein essential for 
the efficient processing of 16S rRNA, al- 
though its function is still unknown. On 
the other hand, the large ribosomal sub- 
unit also seems to depend on chemical 
modifications on the 23S rRNA for cor- 
rect functioning but, once again, not all 
of them are essential [78]. As all genomes 
analyzed contain at least one 23S rRNA 
methylase, the choice was made to in- 
clude CspR, which is widely distributed 
in Gram-positive and Gram-negative bac- 
teria. Huang and coworkers [18] found 
that RlmB, another 23S rRNA methyl- 
transferase, is widely distributed but has 
not been included in the minimal gene-set 


machinery because it is not essential in 
E. coli [81] and M. genitalium [28], and 
it is absent from reduced endosymbiont 
genomes. 

Several GTPases are involved in ribo- 
some assembly [82], four of which were 
included in the CMG, engA, obg, ychF, 
and era, although their implication in ri- 
bosome maturation and stability was not 
clearly defined in 2004 [7]. Currently, it has 
been established that EngA (also known as 
Der) is required for 50S ribosomal subunit 
stability [83, 84], while Obg is involved in 
16S and 23S rRNA maturation [85] and in 
the late steps of 50S ribosomal subunit as- 
sembly [86]. Nevertheless, it appears that 
the main function of Obg is related to chro- 
mosome segregation, thus providing a link 
between translation and the cell cycle [87]. 
Era also appears to couple ribosome as- 
sembly and cell-cycle progression, and to 
be involved in maturation of the 30S sub- 
unit, although its role is still unclear [88]. 
The specific role of YchF remains elusive. 
The gene pbeY was included in the CMG 
because it was widely present in bacteria 
with reduced genomes, although its func- 
tion was not known by then [7]. Recently, 
it has been shown that it is important for 
translation and essential under heat shock 
conditions, although its precise molecular 
function remains unknown [89]. Although 
it is not present in the HCG [18], it is es- 
sential in M. genitalium [28], and therefore 
it should be maintained as part of the min- 
imal gene-set machinery for translation, 
though as a putative nonessential gene. 

The translation machinery requires the 
participation of an ensemble of trans- 
lation factors involved in the initiation, 
elongation, translocation, and termination 
phases of the process. infA, infB, and infC 
are required for translation initiation. efp, 
lepA, fusA (EC 3.6.1.48), tsf, and tufA (EC 
3.6.5.3) encode elongation factors that have 


The Minimal Gene-Set Machinery 


been considered essential for a minimal 
cell [7]. Although many bacteria possess 
two codon-specific release factors, RF1 and 
RF2, they are encoded by two paralogous 
genes. Therefore, it can be considered that 
a single ancestral release factor would be 
enough to recognize the three stop codons 
in a minimal cell. A modulator of the 
release factors activity, hemK, was also in- 
cluded in the CMG because it is present 
in most bacterial reduced genomes and 
it is essential in E. coli. The translatome 
will be complete by the inclusion of tm- 
RNA and smpB, the two components of 
trans-translation, the process used by bac- 
teria to rescue the stalled ribosomes from 
damaged mRNAs, and to release and tar- 
geting incompletely synthesized protein 
fragments for degradation [90]. 

A complete RNA metabolism must in- 
clude RNA degradation as it is an essential 
function of any living cell. However, it 
has not been possible to define a mini- 
mal set of RNases based on experimental 
and computational studies. Only the en- 
donuclease encoded by rnc was widely 
distributed and was included in the CMG, 
provided that it is also necessary for rRNA 
processing [7]. Nevertheless, considering 
that at least one exoribonuclease must also 
be included, pnp (EC 2.7.7.8) was chosen as 
it encodes a highly conserved protein that 
has been found in most bacterial genomes 
except for Mycoplasma. Studies that select 
mycoplasmas as model organisms to ap- 
proach a minimal genome would choose 
instead rnr, the gene encoding RNase R 
(EC 3.1.-.-) [18, 28]. 


3.1.3 Protein Processing, Folding, and 
Secretion 

Two genes involved in protein post- 
translational modifications were included 
on the CMG because they were present 
in all reduced genomes analyzed and 
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essential in E. coli [7]. map encodes a 
widely distributed enzyme that removes 
the N-terminal methionine in nascent pro- 
teins, a critical step in the maturation 
of many proteins. pepA encodes a multi- 
functional DNA-binding aminopeptidase 
which is assumed to play an important role 
in the metabolism of peptides generated 
by protein degradation. 

Proteins must fold correctly to achieve 
functional activity. All living cells possess 
a network of molecular chaperones to help 
in this process and to avoid the danger 
of aberrant folding and aggregation that 
might lead to potentially toxic species [91]. 
Molecular chaperones are quite abundant 
in reduced genomes, where they might 
also be essential to buffer the folding 
defects caused by slightly deleterious mu- 
tations that accumulate due to a lack of 
most repair mechanisms [92]. Some of 
these chaperones, such as the trigger factor 
(TF, encoded by tig), bind in close prox- 
imity to the ribosomal polypeptide exit 
site, and stabilize nascent polypeptides 
while initiating folding; others, like the 
DnaK/DnaJ/GrpE system, are not bound 
to ribosomes but are close to them and 
mediate co- or post-translational folding, 
the translocation of proteins through bi- 
ological membranes, oligomeric protein 
assembly, and their degradation. They also 
distribute subsets of proteins to down- 
stream chaperones, such as GroEL/GroES. 
All of the above-mentioned chaperones 
(except TF) were included in the CMG [7], 
and they have also proven to be essen- 
tial in M. genitalium [28]. tig was found 
in all analyzed genomes by Huang and 
coworkers [18], but it has been lost in 
some endosymbiont genomes (e.g., some 
Buchnera aphidicola strains and Moranella 
endobia) [93-95]. Furthermore, it has been 
proven that £. coli mutants lacking either 
tig or dnaK are viable, although they are 


synthetic-lethal in some conditions [96]. 
Therefore, tig has not been included in the 
minimal gene-set machinery. 

Although protein secretion is not con- 
sidered an essential function in a minimal 
bacterial cell, the transport of nascent 
polypeptides for their integration into 
the cytoplasmic membrane also depends 
on the general secretion (Sec) pathway 
[97]. The Sec translocon is a multi- 
meric enzyme with three essential sub- 
units that are preserved even in highly 
reduced genomes: two integral mem- 
brane proteins, SecY and SecE, form the 
preprotein-conducting channel, while the 
dissociable peripheral ATPase SecA hy- 
drolyzes ATP to drive translocation. For 
the cotranslational integration of mem- 
brane proteins, the translocon interacts 
with ribosome-nascent chain complexes. 
A ribosome-bound chaperone, the signal 
recognition particle (SRP, composed of 
4.58 RNA and the protein Ffh), binds to 
the signal sequence of nascent proteins tar- 
geted for secretion, to assist them in main- 
taining their translocation-competent state 
and to drive them to FtsY, the SRP recep- 
tor. All of these elements are essential 
components of the minimal translocation 
mechanism. Additionally, for the correct 
functioning of translocated proteins, the 
signal peptide must be removed by a pepti- 
dase. Although no signal peptidase was in- 
cluded in the CMG, lepB, encoding signal 
peptidase I, is present in all endosymbiont 
reduced genomes with complete infor- 
mational machinery and, therefore, has 
also been included in this revised version. 
However, it should be noted that M. geni- 
talium does not possess lepB, and that its 
signal peptidase II, encoded by Isp, appears 
to be nonessential [28], and missed the 
threshold cut-off in the studies of Huang 
and collaborators [18]. 


Protein degradation must also be con- 
sidered essential in the minimal cellular 
machinery. It ensures the correct level 
of proteins for cell survival by destroying 
damaged and abnormal proteins, and re- 
moving proteins that are no longer needed, 
also allowing the reuse of the resulting 
amino acids for protein synthesis. The 
CMG included three genes for this pur- 
pose: gcp, hflB, and lon [7]. All three genes 
are essential in M. genitalium [28] and 
are present in the HCG [18]. hflB and 
lon encode two ATP-dependent proteases. 
HfB (EC 3.4.24.-) has been involved in 
the degradation of some integral mem- 
brane proteins, while the protease La 
(EC 3.4.21.53), encoded by lon, degrades 
short-lived regulatory and abnormal pro- 
teins. The role of Gcp was not fully known 
in 2004, but it is a ubiquitous enzyme that 
was annotated as an O-sialoglycoprotein 
endopeptidase, and was assumed to be 
a glycoprotein-related chaperone protease. 
Recent studies have shown that Gcp pre- 
vents the accumulation of highly stable 
toxic compounds derived from nonenzy- 
matically glycated proteins [98]. 


3.2 
The Second Pillar. Energetic and 
Intermediary Metabolism 


The second central pillar of any living 
system is metabolism. A minimal cell 
would require a minimal set of anabolic 
pathways to transform and assemble its 
biomolecule building blocks by using the 
energy and nutrients provided by the envi- 
ronment, in order to achieve metabolic 
homeostasis, and to accomplish cellu- 
lar growth and reproduction (Table 7). 
Metabolism is highly dependent on the 
repertoire of nutrients present in the en- 
vironment. The simplest cell should be 
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chemoorganoheterotrophic (i.e., an organ- 
ism using organic compounds as both 
carbon and energy sources), living in a 
nutrient-rich medium, in which the major 
metabolites (glucose, fatty acids, nitroge- 
nous bases, amino acids, and vitamins) 
would be available without limitation, thus 
making unnecessary the de novo biosynthe- 
ses of these basic components. Neverthe- 
less, considering the versatility of bacterial 
heterotrophic metabolism, many different 
alternative metabolic charts can be pro- 
posed. The alternative proposed by the 
present author’s group in 2004 [7] was 
based on those metabolic functions that 
were preserved in highly reduced genomes 
completely sequenced at that time. The 
CMG encoded the costless pathways that 
would allow the cell to perform the se- 
lected metabolic functions. In order to 
maintain a coherent metabolic function- 
ality, also included in the minimal set 
were some pathways that were not present 
in some host-dependent bacteria, because 
their lack reflected a highly dependence of 
their hosts (Fig. 2). 

The coherence of the minimal meta- 
bolism coded by the proposed gene 
repertoire was further explored [99], 
demonstrating the stoichiometric consis- 
tency of the minimal metabolic network. 
During the fine reconstruction of the min- 
imal metabolic network, we realized that 
a sedoheptulose-1,7-bisphosphatase (EC 
3.1.3.37) should be added to the mini- 
mal genome. Another enzyme included 
in the metabolic network reconstruction 
was cytidylate kinase (Cmk, EC 2.7.4.14), 
involved in the conversion CMP to CDP. 
However, this enzyme was not included 
in the CMG because genomic analyses 
of several bacteria with reduced genomes 
revealed that they possess only one pyrim- 
idine diphosphate kinase; hence, tmk was 
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chosen for this purpose as it was the only 
one present in all proteobacteria with re- 
duced genomes analyzed [7]. In fact, more 
recent studies on bacterial endosymbionts 
of insects have revealed that the minimal 
machinery for the biosynthesis of pyrimi- 
dine nucleotides can be further simplified. 
B. aphidicola BCc (an endosymbiont of 
cedar aphids) [93] and Moranella endobia 
(an endosymbiont of mealybugs) [95] have 
only preserved one diphosphate kinase, 
adenylate kinase (EC 2.7.4.3). Additionally, 
they lack ndk, the gene encoding the nu- 
cleoside diphosphate kinase (EC 2.7.4.6) 
included in the CMG because the pyruvate 
kinase (EC 2.7.1.40), encoded by pykA and 
involved in glycolysis, is able to phospho- 
rylate the purine nucleoside diphosphates 
but not the pyrimidine diphosphates. Con- 
sidering these new findings, it would be 
possible to eliminate tmk, gmk, and ndk 
from the minimal gene-set machinery, 
provided that modified versions of adk 
and pykA, which would encode diphos- 
phate and triphosphate kinases with a 
broader specificity, would be the essential 
genes for the phosphorylation of all nu- 
cleotide monophosphates. Therefore, tmk, 
gmk, and ndk have been labeled as puta- 
tive nonessential genes in this reviewed 
version of the CMG. 

The validity of the minimal network 
proposed here is also highlighted by its 
comparison with the metabolic network of 
M. pneumonia, the pathogenic bacterium 
used as a model by Yus and coworkers 
[100] in their attempt to understand the 
basic principles of bacterial metabolism 
organization and regulation in a bacterium 
with a reduced genome. Based on the 
metabolic network deduced from the M. 
pneumonia genome, composed of 189 re- 
actions catalyzed by 129 enzymes, Yus 
and colleagues experimentally designed 
a defined, minimal medium capable of 


supporting bacterial growth that contained 
only 26 organic components, 19 of which 
were essential nutrients. The metabolic 
network deduced from the CMG is com- 
posed of only 50 reactions, but the nutrient 
requirements are quite similar (21 of the 
26 components proposed by Yus et al.). 
Four of the nutrients that are not needed 
to feed the present network are related 
to specific mycoplasmas’ requirements, 
due to the special nature of their cel- 
lular envelope (cholesterol, choline, and 
spermine) or for gene regulation purposes 
(glycerol). Finally, lipoic acid is not needed 
in the present minimal genome because 
no enzyme using it as a cofactor has been 
included. 

Although ATP synthase was included 
in this category in the CMG [7], it was 
not proposed to serve as a source of ATP, 
but rather as a proton-motive force needed 
for physiologically normal function of the 
cell membrane. Therefore, it has been 
moved to the cell envelope category in 
this reviewed version. 


3.3 
The Third Pillar. The Cell Envelope and Its 
Involvement in Essential Cellular Processes 


In the CMG it was assumed that, in a 
protected environment, the cell wall might 
not be necessary for cellular structure and, 
therefore, it was not considered as an 
essential part of the minimal gene-set ma- 
chinery [7]. Thus, the proposed minimal 
cell would have its genetic and metabolic 
machinery enclosed inside a cell mem- 
brane composed of a lipid bilayer com- 
posed of phosphatidylethanolamine (the 
main phospholipid present in bacterial 
membranes), into which membrane pro- 
teins would be inserted (Table 4). These 


would include the ATPase, the Sec translo- 
con, and transporters. 

A bacterial cell possessing a min- 
imal heterotrophic metabolism will 
be unable to synthesize many essential 
metabolites, and it must rely on the 
environment to obtain them. Several 
low-affinity active transporters, a 
phosphoenolpyruvate-dependent sugar 
phosphotransferase system (PTS), and a 
few ATP-binding cassette (ABC) trans- 
porters are present in all analyzed bacteria, 
although no single transport system 
is universally present. Nevertheless, 
provided that the proposed minimal 
cell has also a simplified cell envelope, 
it must be more permeable to small 
metabolites, being close to what has 
been called a “free-diffusing cell” [101]. 
For this reason, only four genes involved 
in substrate transport were included in 
the CMG [7], to ensure the entrance 
of carbohydrates and phosphate to fuel 
the metabolic reactions, and because 
they were widely distributed: a PTS 
glucose-specific, encoded by the genes 
ptsG, ptsH, and ptsl; and a low-affinity 
inorganic phosphate transporter PitA. 
Additionally, FadD, which is involved in 
the first step of phospholipid biosynthesis, 
catalyzes the esterification of fatty acids 
into metabolically active CoA thioesters, 
concomitant with their transport. Finally, 
ATPase synthase, which is preserved even 
in highly reduced genomes that have lost 
the electron transport chain, was proposed 
to function as a proton pump, to generate 
and maintain a negative transmembrane 
potential, which is necessary for the 
correct functioning of the cell membrane 
[7]. The MCM model proposed by 
Shuler and coworkers includes 19 more 
genes devoted to substrate transport, to 
ensure the correct delivery of substrates 
for sustaining growth, and to prevent 
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lactate accumulation in the cell [47]. 
Nevertheless, transporters are almost 
absent in endosymbiotic bacteria with 
highly reduced genomes and simplified 
cell envelopes, and the decision was 
taken to exclude them from the minimal 
gene-set machinery. It would be necessary 
to perform experimental studies to 
determine whether they are needed, or 
are superfluous, for a minimal cell living 
in a stable and rich environment. 

The last essential cellular process that 
involves cell envelope is cell division. At 
the CMG it was proposed that fisZ alone 
can be sufficient for this purpose, as it can 
provide the constrictive force necessary 
to split the cells. In fact, it is present 
virtually in all species of bacteria and 
archaea whose genomes are available with 
only a few exceptions [102], it is the only 
gene involved in cell shape and division 
that is present in all bacteria analyzed by 
Huang and coworkers [18], and it was 
also considered essential in M. genitalium 
[28]. Recent studies revealed that fisZ can 
be deleted from M. genitalium [103], but 
the mutant cells rely on adhesins, proteins 
related with virulence that are not included 
in the proposed minimal machinery, for 
an alternative mechanism of cell division. 


4 
Conclusions 


During recent decades, research groups 
worldwide have devoted much time and 
effort in attempts to define a minimal 
genome, because knowledge of the essen- 
tial genes for life can provide at least three 
benefits: 


e It will improve the present understand- 
ing of the principles of life, as simpler 
cells would be more easily studied. 
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e Identifying essential genes can allow 
the identification of targets to design 
new drugs for combating emergent 
pathogens [104]. 

e It provides a fundamental step to the 
design of a minimal genome that could 
be used as a “chassis” in synthetic 
biology approaches [105]. 


The main question that remains is 
whether it is possible to sharply define 
a minimal genome. The more that be- 
comes known about life, the more is 
realized that much is still unknown! Genes 
that have been included in different min- 
imal genome versions vary enormously 
depending on the cellular type and its 
environment; thus, the most that can be 
done is to outline the core functions which 
must be fulfilled to sustain life [106]. 
Nonetheless, information will continue to 
be accumulated in the near future so that 
the achievement of a minimal functional 
system will become feasible [11]. Cur- 
rently, cellular organization as a whole 
is not completely understood, and several 
levels of complexity exist (transcriptome, 
proteome metabolome, interactome) that 
have not been taken into account when 
attempting to define a minimal gene-set 
machinery [44]. However, the advent of 
high-throughput sequencing and mea- 
surement techniques has accelerated the 
characterization of some model organisms 
to the extent that comprehensive model- 
ing is now possible. This approach has 
already been applied to Mycoplasma, a bac- 
terial genus with relatively small genomes 
that has undergone extensive analysis, 
and a detailed examination of the pro- 
teome, transcriptome and metabolome of 
M. pneumoniae has been achieved in re- 
cent years [4, 100, 107]. Unfortunately, 
however, the results of these studies have 
revealed the situation to be much more 


complicated than expected, even for a bac- 
terium with a naturally reduced genome. 
Almost a decade ago, by employing a 
conjoint analysis of computational and 
experimental approaches, the existence of 
the CMG~a set of 206 protein-coding 
genes that appeared essential to sustain 
a minimal cell—was proposed. In this 
chapter, an attempt has been made to 
refine this suggestion, based on current 
knowledge and the addition of RNA genes 
to the proposed gene-set. The minimal 
gene-set machinery presented here is 
composed of 187 to 205 protein-coding 
genes and 35 to 38 RNA genes, and is 
sustained by three main “pillars”’: 


1. The genetic machinery, composed of 
virtually complete DNA replication 
and translation machineries, complex 
translation machinery and a simple 


DNA repair system. 
2. The energetic and intermediary 
metabolism, in which energy is 


obtained via substrate-level phospho- 
rylation, while the basic elements 
(glucose, phosphate, fatty acids, 
nitrogenous bases, amino acids, 
vitamins, and inorganic ions) are 
provided by the environment for the 
synthesis of essential cell components. 

3. The cell envelope that encloses the 
genetic and metabolic machineries, 
gives shape to the cell, is in charge of 
the interaction with the environment, 
and grows and divides to allow the 
formation of daughter cells. 


Only a few changes have been made 
in the list of protein-coding genes with 
regards to the previous CMG [7]. In this 
refined version, no poorly characterized 
genes are included, but among the eight 
genes in the former category four have 


been reassigned because they have al- 
ready been associated with specific cellular 
functions in translation (mraW, rsmL, tilS, 
and ybeY). The other four genes (ycfF, 
ycef H, yoaE, and yqgF) have been removed 
because, although they were present in 
all reduced genomes used in the 2004 
comparison, they do not have a currently 
defined function assigned to them. Al- 
though all of these genes (except yqgF) 
appear to be essential in M. genitalium 
[28], none of them has been included in 
the HCG [18]. In addition to the essential 
RNA genes, several protein-coding genes 
have been added. These include trmD, 
which is involved in tRNA methylation, 
and lepB, which encodes a signal pepti- 
dase. dnaA has been included as optional 
because, based on current knowledge, it 
is not clear if it would be needed in a 
minimal cell machinery. Some genes al- 
ready included in the CMG might not be 
absolutely essential, such as holA (encod- 
ing the subunit 6’ of DNA polymerase), 
up to 11 ribosomal proteins, and ksgA (in- 
volved in ribosome maturation). Finally, 
it must be taken into account that the 
maturation of ribosomes and tRNAs in- 
volves quite complex processes that are 
not fully understood, and essential genes 
might be missing to accomplish the as- 
sembly of a functional translatome. In any 
case, it must be stated that the definition 
of a minimal gene-set machinery remains 
a highly speculative exercise because, with 
the available data, it is impossible to de- 
termine if the specified gene components 
might be sufficient to maintain cell life. 
The results of recent studies have indi- 
cated that the addition of some genes not 
present in the minimal list might improve 
cell efficiency [74, 108], and for that reason 
the decision was taken to include on the 
proposed list some putatively nonessen- 
tial genes that might help in terms of cell 
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performance. Experimental studies of pu- 
rified and crude subsystems will surely 
help in the determination of their essen- 
tiality, as well as in the identification of 
some additional components. 

Finally, it should be noted that a defi- 
nition of the minimal gene-set machinery 
will not lead to a full definition of a min- 
imal cell; rather, this goal will not be 
achieved until some additional aspects of 
the genome and cellular organization are 
considered. At present, information is still 
lacking regarding the three-dimensional 
properties of the bacterial chromosome 
[109], the regulatory mechanisms needed 
for correct and coordinated gene expres- 
sion in a minimal living system [66], and 
the complex patterns of protein-protein 
interactions, all of which generate a com- 
plex spatial and functional network that 
most likely holds all of the cellular con- 
tents connected to the cell membrane [44]. 
At present, it is difficult to predict how long 
it will take to untangle all of these addi- 
tional layers of complexity in a modern cell. 
Nonetheless, these challenges will need to 
be addressed before a minimal cell can be 
designed and, in a more biotechnological 
vein, engineered for specific applications 
[110]. The era of synthetic biology has just 
begun. 
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Keywords 


Molecular cloning 
A set of experimental methods in molecular biology, which are used to manipulate 
recombinant DNAS and to introduce their DNA into host organisms. 


Metabolic engineering 
The optimization of genetic regulatory processes within cells to increase their production 
of a certain substance. 


Gene assembly 
Process to connect multiple genes or DNA fragments from various origins. Sequence 
information for the final assembled structure is a prerequisite. 


BGM vector 

Bacillus subtilis Gene/Genome Manipulation (BGM) vector, derived from the 4200-kbp 
genome of B. subtilis 168, has been demonstrated to accommodate fairly large DNAS. 
The vector is highlighted by the successful stable cloning of a whole 3500-kbp genome 
of the nonpathogenic, unicellular photosynthetic bacterium Synechocystis, and any 
sequence-known DNAS. 


OGAB method 

A novel DNA fragment assembly method named after Ordered Gene Assembly in 
Bsu168 (OGAB), produces a DNA in plasmid form via Bsu168 transformation. The 
method produces outstanding breaks that are applicable to create large DNA cassettes 
with many relevant genes. 


Long-term DNA reservoir 

A giant DNA constructed by gene assembly. These are highly valuable and desirable 
for future studies, and should be preserved without damage. The preservation of DNA 
sequence information, as well as DNA resources if they cover a DNA library for various 
species, is very important for not only DNA editing but also genome synthesis in the 
future. 
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Mammalian cells possess two distinct types of DNA: the mitochondrial genome; 
and the nuclear chromosome. The former may be only 16.5 kbp in size. Although 
the mammalian mitochondrial genome is far smaller than the DNA cloned using 
Escherichia coli during the early 1990s, cloning of the first whole mammalian 
mitochondrial genome was not accomplished until 2005. The nuclear chromosome, 
which is too large to be cloned by any method, has been segmented into an 
Escherichia coli BAC vector, and is now available as a mouse BAC DNA library. 
For sophisticated genetic research, large, complicated and highly engineered 
DNA is required. However, current protocols rely too heavily on the E. coli 
molecular cloning system, and cannot meet all present research demands. In 
this chapter, an alternative novel cloning system is reviewed, termed Bacillus 
genome manipulation (BGM). The BGM system is not merely an auxiliary method 
but rather a crucial advance that satisfies nearly all requirements for a wide 
range of global applications. Attention will be focused on three BGM topics 
relevant to mouse genetics and genomics, with reference to recent studies. 
These topics include: cloning full-size mouse mitochondrial genomes; giant 
mouse chromosomal segments deliberately engineered for transgenic mice; and 
the long-term preservation of valuable DNA. The latter topic, though rarely 
discussed, is particularly important for controlling the costs of future uses of 


large DNA. 


1 
Introduction 


The molecular cloning of DNA is one of 
the fundamental technologies of modern 
biology. Over the past decade, advances in 
DNA cloning technology have ushered ina 
new era in which the use of giant DNA has 
become more prevalent [1-8]. The recent 
burst of genome sequence information 
has permitted innovative technologies to 
be developed in which affordable genes are 
not limited to existing genes but have been 
expanded to include de novo-designed and 
synthesized genes. Increasing the access 
to synthetic DNA [6, 7], which in turn al- 
lows the genetic engineering of model mi- 
crobes, animals and plants, should deepen 
the current basic knowledge of the life 
sciences and also contribute to practical 
applications. Today, reverse genetics for 


model animals demands larger and more 
complex size variations, due to increases 
in the number of regulation factors to be 
examined [9]. The most important animal 
mouse model, Mus musculus, carries two 
distinct genomes in the manner of Homo 
sapiens: one in the nucleus consisting of 
17 chromosomes, and the other in the 
organelle mitochondria [10]. The intro- 
duction of designed mutations requires 
various levels of designed recombinant 
DNA that are supplied mostly through 
a conventional Escherichia coli K-12-based 
molecular cloning system [11]. However, 
the DNA products that can be produced 
using the current E. coli system can- 
not readily meet the growing demands, 
and also possesses certain restrictions that 
will be referred to in this chapter. As an 
alternative, a novel cloning system called 
Bacillus genome manipulation (BGM) will 
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be introduced in this chapter. It is impor- 
tant to note that the BGM system is not 
merely an auxiliary method but rather a 
crucial advance that satisfies nearly all of 
the requirements for a wide range of global 
applications (see Sections 2—4). Four types 
of DNA molecule topic are included: 


e Sequences that are difficult to clone 
and maintain using £. coli, such 
as whole-mitochondria genomes (mtD- 
NAs) [10], which initially proved to be 
obstinate when cloning by E. coli but 
were later improved (see Sect. 3.2). No 
apparent difficulties were encountered 
when cloning these mtDNAs using the 
BGM system [1, 12], or when using Sac- 
charomyces cerevisiae [13]. 

e A system which permitted an effec- 
tive and precise assembly of multiple 
DNA fragments that originally were dis- 
persed at different chromosome loci. 
The assembly of multiple DNA frag- 
ments using the BGM system yields 
complicated DNA sequences, much like 
those required for purpose-oriented ge- 
netics (see Sect. 3). 

e Large DNAs produced by the system, 
which cover most genes of the mouse 


genome, including a number of introns, 
and regulatory cis and trans sequences. 
In some cases, the DNA size may 
eventually exceed the potential size 
limits (estimated as 350-400 kbp) by 
E. coli bacterial artificial chromosome 
(BAC) vectors [3, 5]. Some representa- 
tive examples of large DNAs produced 
by the BGM system, with little difficulty, 
are described in Section 4. 

e Engineered BAC DNAs, which are valu- 
able but difficult to reproduce. The 
use of B. subtilis allows the long-term 
preservation of such DNAs at low cost 
(see Sect. 4.5). 


2 
Giant DNA Handled by the BGM System 


2:1 
Differential Principles of DNA Uptake 
between E. coli and B. subtilis 


The cloning of DNA is generally carried 
by plasmid vectors, which replicate and 
are maintained independently from the 
host genome. In a typical cloning process 
in E. coli, the DNA fragments prepared 
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Fig. 1 (a) Different molecular mechanisms 
for DNA uptake through a membrane. The 
green and black ovals inside rectangular cell 
walls represent the genome of B. subtilis and 
E. coli, respectively. Plasmids are drawn as cir- 
cles carrying a replication initiation site (small 
open oval). In B. subtilis, the protein com- 
plexes formed in the membrane bind dsDNA 
non-sequence-specifically and are processed 
to deliver linear single-stranded DNA (ss- 
DNA) into the cytoplasm through the B. sub- 
tilis membrane, as described in the text. Ge- 
netic pathways to yield dsDNA as a plasmid 
form or integrated form are shown in Figs 1b 
and 2a, respectively. A technically important 
advantage is that the plasmid does not need 
to be circularized prior to transformation. In 


E. coli, DNA delivery should be performed 

via chemical treatment to loosen the cellular 
membranes for the uptake of DNA from the 
outside, because genetically coordinated ac- 
tive molecular mechanisms for DNA delivery 
is totally absent. Plasmid DNA delivered to E. 
coli must be in a circular dsDNA form to start 
replication [14-16]; (b) Transferred DNA to 
replicate as plasmid. The plasmid unit (bold 
line or circle) is illustrated by a replication ini- 
tiation site (open circle), sandwiched by AB 
and CD sequences. DNA with a tandem repeat 
of the plasmid unit (left) can be processed to 
form a circular plasmid. Transformation by 
DNA with a unit length (right) remains too 
short to be repaired and abortive. 
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by either physical shearing, endonuclease 
digestion or PCR-mediated amplification, 
and are connected enzymatically by DNA 
ligase to the plasmid vectors in a test tube. 
As indicated in Fig. 1a, DNA cloning by 
E. coli has an absolute requirement for 
vector plasmids [14-16]. In addition, the 
target DNA and vector plasmid should be 
combined and circularized in vitro prior to 
transformation, in order for them to start 
replication. The absolute requirement for 
a circular form is strongly associated with 
the manner of physical delivery of DNA 
into the E. coli cytoplasm. 

In sharp contrast, the whole process of 
DNA delivery into the B. subtilis cytoplasm 
is regulated in a genetic manner — that 
is, it is dependent on the natural com- 
petency developed by an inherent set of 
competent genes of B. subtilis [17-19]. 
Those genes encoding DNA uptake and 
processing systems are often functionally 
similar to subunits of the type IV pili and 
type II secretion systems [20]. So, besides 
details that refer to articles about natu- 
ral competent and DNA uptake [17-20], 
why has B. subtilis been employed as a 
cloning host for DNA cloning or synthesis? 
Suffice to mention here that the molecular 
mechanism for genetic transformation of 
B. subtilis provides two different DNA 
molecules, namely plasmid by recircular- 
ization (see Sect. 2.2) and integrated into 
the genome (see Sect. 2.3). However, the 
BGM system permits a constant produc- 
tion of DNAs that would not be achievable 
by E. coli (see Sections 3 and 4). 


22. 
DNA Cloned in the Plasmid Form in 
B. subtilis 


Plasmid, circular double-stranded DNA 
(dsDNA) requires a replication origin 
sequence (ori) and accessory proteins to 


support maintenance. Distinct from E. coli 
transformation, plasmid DNA delivered 
to B. subtilis via genetically regulated 
transformation requires a structure of 
multiple unit lengths tandemly repeated, 
as depicted in Fig. 1b, indifferent from 
circular or linear forms. This simple 
requirement meets the high yield of linear 
DNA produced by connecting multiple 
fragments via an in-vitro ligation reaction, 
where circularization rarely takes place [7, 
8, 21, 22]. The assembly of number of 
genes in one plasmid by the BGM system 
is an inevitable tool for various approaches, 
and in particular for synthetic biology 
or metabolic engineering at present. It 
should be also applicable to the design and 
production of engineering mtDNA and 
accessory sequences for research using 
mice that is currently under way (M. Itaya, 
S. Kaneka, unpublished data). 


2.3 
DNA Cloned in the B. subtilis Genome 


DNA molecules unable to replicate in B. 
subtilis are processed in transformation in 
one of two ways: (i) they are digested in 
the cytoplasm (Fig. 1a); or (ii) they are 
integrated into the genome where homol- 
ogous sequences are present (Fig. 2a) [1, 
2, 5, 7, 8, 22]. Accessory sequences are 
used for selection in the cloning processes 
illustrated. 

In order to extend general cloning 
methods by exploiting the process of 
integration into the B. subtilis genome, nu- 
merous research groups have attempted 
to use plasmid pBR322, a general cloning 
plasmid for E. coli [1, 2, 23, 24, 27, 25, 28]. 
DNA cloning using the pBR322 vector 
is simple but lacks selection markers for 
B. subtilis, and consequently derivative 
plasmids possessing antibiotic resistance 
marker genes for B. subtilis have been 
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constructed to stimulate and simplify the 
transfer procedure. DNA cloned into the 
E. coli plasmid vector is immediately trans- 
ferred in the BGM system, while DNA 
cloned in pBR322 via the E. coli molecular 
cloning system immediately becomes 
guest DNA in the pBR322-based BGM. 


2.4 
Assembly of Contiguous DNAs in the 
B. subtilis Genome: the Domino Method 


The connection of two adjacent DNAs, 
separately cloned in the E. coli pBR322 
plasmid and transferable into the B. 
subtilis genome, worked out very well 
(as expected), and was extended to the 
connection of more than three DNA 
fragments by the alternate use of selection 
markers (see Fig. 2b). The domino method 
simply requires a full set of domino clones 
that covers the entire target genome. 
Gaps due to a lack of available domino 
clones are resolved when a PCR primer 
is appropriately designed. Since the first 
complete genome cloning, which was 
achieved as early as 1995 from E. coli 
bacteriophage lambda (48.5kbp, a very 
large DNA at that time [24]), various 
large DNA loci of mouse chromosomes 
[29-31] or mtDNA [1] have been reported. 
The domino method has several distinct 
advantages: 


It does not require highly purified large 

DNA molecules. 

e The recombinant genome structure can 
be freely designed by choosing the first 
and the last dominos. 

e The cloned DNA remains structurally 
very stable because there is only a single 
copy in the B. subtilis genome. 

e The size of the domino clone can 

vary, depending on the design and 

preparation method. 


It seems noteworthy here that dominos 
prepared in BAC vectors also worked out 
as expected (see Sections 3.3 and 4). 


3 
Production of a Full-Length Mouse 
Mitochondrial Genome by the BGM System 


Mitochondrial DNA genomes (mtDNAs), 
which are present in eukaryotic cells, vary 
in size from about 16 to 17kb in mam- 
mals. The typical animal cell contains 
hundreds of mitochondria, which produce 
the majority of the cell’s ATP through 
oxidative phosphorylation [10, 32, 33]. Hu- 
man mitochondria appear to play central 
roles in many fundamental processes of 
life — not only in the cells’ metabolic ac- 
tivity to produce ATPs but also in cell 
signaling, fertilization, development, dif- 
ferentiation, aging, apoptosis, and gen- 
der determination [33]. The cloning of 
full-length mtDNA would provide a re- 
source for various molecular biological 
research schemes, such as attempts to 
create recombinant mtDNA haplotypes in 
mammalian cells [34-36], including at- 
tempts to replace cellular mtDNA [37, 38] 
and studies using DNA replication [39, 40]. 
The cloning of mammalian mtDNA had 
been problematic, however, despite many 
attempts being made until 2003 which 
might be attributed to the E. coli vector 
system [10, 34]. 


32 
Cloning and Engineering of mtDNA in 
E. coli 


The cloning of mtDNA from mice (16.3 kb) 
was first reported in a pioneering study 
performed by Yoon and Coob [10], who 
showed that a full-length mtDNA in plas- 
mid form could be produced in E. coli by 
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direct selection of circular mtDNA that had 
been carefully isolated from mouse liver, 
and into which an E. coli cloning vector had 
been randomly inserted in an in-vitro trans- 
position reaction. Three types of clone 
were obtained in this way, one each in 
which the E. coli vector was inserted into 
the ND1, COXII, and NDS, respectively. 
According to these authors, the recombi- 
nant mtDNA was amplified in E. coli and 
then transferred back into the mtDNA-less 
pO mouse cell line LL/2 for subsequent 
analyses [10]. The success of the proce- 
dure depended on the careful prepara- 
tion of unsheared, intact circular mtDNA, 
and was technologically fortuitous because 
the site at which the mtDNA was inter- 
rupted by the cloning vector was found 
to depend on the transposon-mediated 
insertion process, which was difficult to 
regulate. Later, the same group estab- 
lished a more general protocol in which 


< 


PCR-amplified segments were connected 
step-by-step in a sequential manner to 
complete the mouse mtDNA, with no se- 
quence misincorporations [34, 35]. 


3.2 
Direct Cloning by Use of Purified mtDNA 
by BGM 


An alternative protocol for the stable com- 
plete mouse mtDNA, which consisted of 
direct cloning into the B. subtilis genome, 
was first developed in 2007. This process 
also required highly purified mtDNA from 
mouse liver [12]. In this case, two PCR 
DNA segments of 2.06 and 2.14kb that 
flank the internal 12kb were subcloned 
into an E. coli pBR322 plasmid. Subse- 
quent integration of the plasmid at the 
cloning locus of the BGM vector shown 
in Fig. 2a yielded a recipient B. subtilis 
for the target integration of the internal 


Fig. 2. (a) General cloning via the pBR322 
sequence. Double-stranded DNAs (dsDNAs) 
presented outside of competent B. subtilis cells 
are converted to single-stranded DNAs (ss- 
DNAs) during the DNA-uptake process. The 
resultant ssDNA shown in Fig. 1 is highly re- 
combinogenic and is immediately recruited 
by cellular RecA protein that guides to the 
recA-dependent homologous recombination 
pathway. The DNA promptly recombines 

with the counter homologous sequences (if 
present) in the genome. The inherent na- 
ture to recombine between two homologous 
sequences, indicated by X, permits and fa- 
cilitates the integration of DNA at the target 
pBR322 site installed previously [23-25]. Han- 
dling BACs are carried by in BAC vector se- 
quences that replaced the pBR322 [26]. The 
open and closed arrows indicate pBR322 or 
BAC vector sequences with replication initia- 
tion site (small open oval). B. subtilis strains 
carrying these sequences, called Landing Pad 
Sequences (LPS) in the genome, are generally 
named as BGM vector. pBR322 sequences in 


the BGM vector are specified as GpBR. The 
red lines indicate target DNA in the BGM sys- 
tem; (b) Domino cloning. All domino DNA 
colored by red lines share two common struc- 
tural features: possession of a sequence over- 
lap with the adjacent domino insert; and two 
identical regions of the pBR322 sequence. The 
two pBR halves (the amp and tet halves) are 
illustrated partly introduced as open and close 
arrows in Fig. 2a, and play essential roles and 
are vital in BGM cloning at any stage. The 
same two pBR halves were integrated ear- 

lier in particular loci of the B. subtilis genome 
[1, 23]. This genome-integrated pBR form, 
termed GpBR, was proven to accommodate 
DNA sandwiched by the two pBR halves used 
as LPS. The combination of the pBR part of 
domino clones and the GpBR assures that 
DNA integrated into the B. subtilis genome 
always remains flanked by the amp and tet 
halves of GpBR. The first domino integration 
is similar to that in Fig. 2a. Antibiotic selec- 
tion markers (small open and closed circles) 
are inserted at the end of the target DNA. 
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12kb mtDNA region [12]. The recipient 
cell took up mtDNA that had been puri- 
fied from the mouse liver, and integrated it 
by homologous recombination at the two 
preinstalled mtDNA-flanking sequences. 
The complete cloned mtDNA in the BGM 
vector was then converted to a covalently 
closed circular (ccc) plasmid form via gene 
conversion in B. subtilis (see Sect. 4.2.3). 
The final structures of mtDNA, as shown 
in Fig. 3b, possess plasmids capable of 
propagation both in B. subtilis and E. coli 


[1]. 


3:3 
Synthesis of mtDNA by the Domino 
Method 


Domino clones for mtDNA reconstruction 
were prepared using PCR products ob- 
tained from genomic DNA purified from 
BALB/c mouse kidney, ranging in length 
from 4 to 6kb The four dominos illus- 
trated in Fig. 3a were assembled at a BGM 
cloning locus where the pBR322 sequence 
had been installed [1]. In this way, re- 
combinant mtGI [1-2-3-4] was obtained 
by placing domino 1 in the first posi- 
tion, followed by dominos 2 to 4. Three 


other permutations of the order of the four 
dominos illustrated in Fig. 3b were made 
in similar fashion. These four mtDNAs, 
when stably maintained in the B. subtilis 
genome, should be converted to a circular 
DNA form for different aims in mouse 
mitochondrial genetics [35, 36]. Conver- 
sion of the genome-integrated part to the 
circular plasmid form was performed ef- 
ficiently by the Bacillus Recombinational 
Transfer (BReT) method, which is one 
of the technologies used in BGM (as 
shown in Fig. 10). Finally, the four circu- 
lar mtDNA species were transferred into E. 
coli, taking advantage of the shuttling na- 
ture of the BReT plasmid, and this resulted 
in no apparent growth reduction, which 
was consistent with previous reports [12, 
34]. Cloning of the mouse mitochondrial 
genome was reported by other groups us- 
ing E. coli, together with the latest reports 
using PCR [34, 35] and yeast, a eukary- 
ote host which is also suited to Giant 
DNA cloning [13]. To summarize, mtDNA 
molecules are currently deliberately syn- 
thesized with minimal difficulty, though 
technical advances will be required before 
the designed molecules can be utilized to 
replace cellular mtDNA [35-37]. 
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Fig. 3 (a) Domino sets to complete 

mtDNA cloning. Four amplified segments 
used as domino from mouse mtDNA are 
shown. These are independently cloned in 
pBR322-based E. coli plasmids specialized for 
the domino method [1]. Some length variation 
stems from technical requirements, depending 
on the first domino to start domino elonga- 
tion [1]. The structure and gene content of 
mtDNA from a mouse BALB/c is shown at 
the bottom left. Nucleotide positions of de- 
signed domino fragments are listed in Ref. [1]; 
(b) Complete mouse mtDNA by a domino 
method. Four different mtDNAs in the B. 
subtilis genome are shown with intermediate 


constructs integrated in the B. subtilis genome 
[1]. Transfer of complete mtDNAs to the 
plasmid form was conducted by a retriev- 

ing method (as described in Sect. 4.2.3 and 
Fig. 10). The plasmid insert locus is different 
according to the first domino adopted. Associ- 
ated antibiotic selection markers are illustrated 
by small open and closed circles. The gray 
circle at the bottom shows retrieved mtDNA, 
while the pink and green arrows in the outer 
circle represent pBR322 sequence inserted into 
mtDNA with the original BGM strain’s name. 
The pBR sequences played both in retrieval 
protocol (BReT, shown in Fig. 10), and allow 
replication in E. coli [1]. 
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4 
Production of Heavily Manipulated DNA 
for Mouse Genetics 


4.1 
BAC Plasmid Vectors for Large DNA 
Handling in E. coli 


The size of the domino fragments is a 
key factor for a rapid completion with 
fewer domino elongations. An increase 
in the domino size should automatically 
decrease the total number of dominos. 
The BAC vector, introduced as early as 
in 1992, is capable of cloning DNA as 
large as 100-200 kbp [40]. Despite the im- 
plications of the nomenclature, it is not 
an artificial chromosome but rather an 
F plasmid-borne cloning vector plasmid 
for E. coli. During the 1990s, BAC li- 
braries similar to those shown in Fig. 4 
were created from mouse or human 
genomes that had been designed to pro- 
vide DNA for contributing to long-range 


Genome DNA 


v 


Cloning site 


BAC 
vector 


Fig. 4 BAC library for the genome DNA. The 
ability of the BAC vector to carry large DNA 
(>100kb) is useful not only for the direct 
analysis of genomic regions in reverse ge- 
netics, but also for the de novo synthesis of 


sequence determinations [41, 42]. Al- 
though recent sequence determinations 
by next-generation sequencers have dra- 
matically reduced the need for BAC li- 
braries as resources, it must be noted that 
BAC clones can still be used successfully 
in comprehensive mouse genetics. Thus, 
mouse geneticists should have appropriate 
protocols available for manipulating BAC 
inserts. Such manipulation would include 
starting from subcloning onto another 
plasmid, the insertion or deletion of re- 
porter genes, the addition of drug-selection 
or visualization markers, and the intro- 
duction of mutations in introns or coding 
regions (as summarized in Fig. 5). Com- 
mercially available E. coli BAC kits may 
be helpful in the engineering of BACs; 
however, as summarized in Fig. 6, E. 
coli BAC kits seem to have greater re- 
strictions and limitations for the precise, 
stable and effective production of mutag- 
enized BACs in comparison to the BGM 
system. 


genomic DNA in the field of synthetic biology. 
The open and closed arrows show BAC vec- 
tor halves, as already indicated in Fig. 2a. A 
BAC library carrying genomic large inserts is 
illustrated at the right. 
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Fig. 5 BGM pipelines produce multiple and 
comprehensively engineered BAC clones. The 
case of a representative BAC is shown [5]. A 
mouse genomic DNA fragment (red line) is 


4.2 
BAC Domino Transfer in BGM 


Before the BGM system can be used, the 
BACs must first be transferred into the 
system. The BAC flow from E. coli to 
the BGM vector is shown in Fig. 7a. In- 
tegration required only two homologous 
sequences and appropriate selection mark- 
ers for the bacterium. The BAC vector 
sequence was installed at the integration 
locus, under a molecular principle simi- 
lar to that of the previous pBR322-based 
integration [26]. One major difficulty with 
this installation lies in the selection of cor- 
rect BGM integrants, because commercial 
or laboratory-prepared BAC clones carry 
no appropriate selection markers for B. 
subtilis. This drawback was resolved by 


Insertion 


Deletion 


Transposition 


Inversion 


- 


Replacement 


cloned into the BGM vector via a BAC clone 
(e.g., BAC1) carrying exons and introns. Large 
DNA can be manipulated flexibly and rapidly 
in the BGM system. 


the pre-installation of a counterselection 
system into B. subtilis to stimulate the in- 
tegration process, as shown in Fig. 7b. 
Several mouse BACs that were trans- 
ferred using the genetic devices without 
causing any structural disorder were ex- 
amined for their potential use in com- 
plex applications, as shown in Fig. 5 
[4, 26, 30, 31, 43]. These applications 
consisted of the formation of variants 
in size and orientation (as described in 
Sect. 4.2.1), the connection of two over- 
lapping BACs (Sect. 4.2.2), application to 
the generation of transgenic mice (Sect. 
4.3), and the unique preservation of de- 
signed BACs in BGM for prolonged stor- 
age in the absence of special facilities 
(Sect. 4.5). 
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Fig. 6 Engineering of BAC inserts. A B. sub- 
tilis (BGM) system producing multiple ma- 
nipulations of large BAC-DNA fragments is 
possible in BGM via recA-dependent homol- 
ogous recombination. The system allows for 
multiple operations, conducted for an unlim- 
ited number of times, in sequential manner. 
In the E. coli system only single manipulations 
can be performed utilizing RecE/T, Cre/LoxP, 


4.2.1 Complex Engineering for BAC 
Inserts (Deletion, Inversion) 

Because the orientation of BAC inserts 
during library preparation is basically 
random, molecular tools to invert the 
insert transferred to BGM have been 
developed. The application of modified 
tools to invert the large B. subtilis genome 
regions [44, 45] reverses the orientation of 
BAC inserts in BGM [43]. With regards 
to systematic deletion, two small DNA 
segments with deletion endpoints are 
connected so as to flank the appropriate 
antibiotic resistance markers for B. subtilis, 
as indicated in Fig. 8a,b. The region to be 
deleted was replaced by a marker gene via 
two homologous recombinations at the 
two flanking segments. Figure 8b shows 
an example of nested deletions by the BGM 


or FLP/FRT (site-directed recombination tech- 
nology). The stability of DNA attributable to 
the one-copy state of the B. subtilis genome 
assures the maneuverability of recA-dependent 
homologous recombination. The open and 
closed arrows are BAC vectors. The red or 
blue lines show independent inserts cloned 

in BACs. 


protocol that were located in the mouse 
jumonyi (jmj) gene locus, and ranged from 
11 to 86 kbp. 


4.2.2 Connection of Two Adjacent BACS 
in BGM 
The domino inserts of BAC should func- 
tion in a manner similar to that described 
for pBR322. However, the selection mark- 
ers for B. subtilis again raise a problem 
with respect to the use of BAC-dominos. 
Because BAC vectors normally do not pos- 
sess ab initio selection markers effective 
for B. subtilis, the required markers were 
inserted after BAC integration in BGM [2, 
46]. 

As shown in Fig. 9, connection of the 
two overlaid BACs in the BGM system 
was first demonstrated for the two BACs 
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B. subtilis 


‘Gant 


Direct BAC delivery to 
chemically treated 
E. coli 


p108BGMC 
p108BGME 


Fig. 7. (a) BAC in BGM. The integration of counter selection system is shown. The cl 

BAC inserts (red) into the B. subtilis genome repressor gene and the neomycin-resistance 
(green oval) is a starting point for subsequent gene (neo) (pink) under the Pr prompter (pink 
manipulations. The BAC vector region (BAC*; arrowhead) result in a positive selection of 
open and closed arrows) preinstalled in the marker-less BACs for integration of the BAC 
host genome provides the cloning site for clone (e.g., BACT) [2, 5]; Lower: BAC clones, 
guest BAC clones via homologous recombi- if made in the new BAC vectors, p108BGMC 
nation (identified by the crossed lines); (b) or pl108BGME, carrying an antibiotics marker 
Selection of markerless BAC integration in (small open circle) for B. subtilis can be cloned 


the B. subtilis genome. Upper: The present directly in BAC-BGM [5]. 
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Fig. 8 (a) Size variation: concept underlying 
the deletion formation. Two small DNA seg- 
ments with deletion endpoints are connected 
so as to flank appropriate antibiotic resistance 
markers (small open and closed circles) for 
B. subtilis. The region to be deleted results 

in replacement by a marker gene via two ho- 
mologous recombinations at the two flanking 
segments. If half of the pBR322 sequence 
(blue and yellow stripe) or the BAC sequence 
used as LPS in Fig. 2b is chosen as one seg- 
ment, then a nested deletion is performed, as 
shown in Fig. 8b. The green curved line shows 
the B. subtilis genome. Marker genes cat or 
erm exhibit resistance to chloramphenicol or 


|-Ppol 


erythromycin; (b) Transformation using the 
optional small DNA fragments and antibiotic 
resistance markers produces the designed 
deletion. Systematic deletion formation from 
a mouse genomic region including the jmj 
gene (110kb) was performed. The two I-Ppol 
sites described in Sect. 4.2.3 are preinstalled 
to generate insert fragments. Shortened DNA 
fragment are visible, and their size was con- 
firmed by |-Ppol fragments resolved by gel 
electrophoresis in the image at right (open 
arrowheads with the size of the deletion are 
indicated at top right). All symbols (open ar- 
row, open and closed circles, color boxes, and 
red lines) are as in Fig. 8a. 
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including a mouse gene jumonji (jmj) locus 
[31]. The two adjacent inserts, of 196 and 
220kbp, which shared DNA of approxi- 
mately 60kbp in size and were expected 
to have different end antibiotic resistance 
markers, were inserted at the end of their 
inserts. After insertion of the appropri- 
ate antibiotic resistance marker genes in 
the two BGM clones, the connection of 
the two overlapping dominos produced a 
355 kb-long DNA, as illustrated in Fig. 9 
[31]. The connection of two BACs by an 
E. coli system alone is considered pos- 
sible, but is rarely made. The connec- 
tion of two overlapping BACs in BGM 


pKANEG(196kb) 


pKANEH(220kb) 


also successfully yielded a 240kb-long 
DNA covering the mouse class I odorant 
receptor (OR) gene locus (as described in 
Sect. 4.3) [43]. 

4.2.3 Isolation and Purification of BAC in 
BGM 

The engineered BAC DNA that remains in 
the B. subtilis genome must be collected in 
a test tube in order to produce the trans- 
genes. Three methods were proposed, as 
summarized in Fig. 10. 

The simple and most straightforward 
method is digestion by specialized 
endonucleases, followed by subsequent 
isolation of the engineered BAC from 


(kbp) 
196 355 220 


Fig. 9 Connection of two BACs with an over- 
lap in BGM. Two red and blue BAC clones, 
pKANEG (196kb) and pKANEH (220kb), cover 
the mouse genomic jmj region; sharing a 

60 kb overlap sequence. Each BAC was sep- 
arately integrated into BGM. Transformation 
of the BGM carrying the 220kb BAC2-DNA 

by using purified genomic DNA from another 
BGM carrying the 196 kb BACI-DNA leads to 
homologous recombination between the 60 kb 


|-Ppol digestion 
by CHEF 


overlapping region and the sequence shared 
with the B. subtilis genome portion (splice 
boxes). Connection of the two BAC yielded 
355 kbp DNA, including the entire jmj gene. 
The BAC vector region shows open and closed 
arrows. BAC insertion after |-Ppol digestion 
was confirmed by agarose gel electrophoresis 
(right panel). The I-Ppol recognition sequence 
shown in Sect. 4.2.3 is indicated by I. 
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1. Fragmentation by endonuclease 


|-Ppol |-Ppol 


2: Genome dissection dissection 


3. [Recombinational] Transfer 


Fig. 10 Three methods to retrieve BAC-DNA 
from BGM. (1) Fragmentation using endonu- 
cleases is simple. Improvements for the iso- 
lation step lead to transgenic mice [43]. (2) 
Intrachromosomal homologous recombina- 
tion between the two DNA repeats (striped 
boxes) makes it possible to physically discon- 
nect the BAC segments. Disconnecting DNA 
is designed to carry a DNA replication origin 
site (oriN; striped circle) different from oriC 
of the chromosome, it starts replicating au- 
tonomously as a plasmid independent from 
the chromosome [47]. The potential size to 
be disconnected is estimated to be on the 
order of several hundred kilobase pairs (M. 


the electrophoretic gel [4, 26, 30, 31, 
44]. Two endonucleases have been 
designed for this purpose and are 
regularly used in the authors’ BGM 
system: I-Ppol for the 23-base sequence 
ATGACTCTCTTAA/GGTAGCCAAA, 

and I-Scel for the 18-base sequence 
TAGGGATAA/CAGGGTAAT. In fact, 
two I-Ppol recognition sequences are 
already created in the B. subtilis genome 
at the two ends of the BAC integration 


Tg mouse 
(founder) 


SN 
Sy 


Itaya, S. Kaneko, unpublished data). (3) For 
recombinational transfer, see the text and Refs 
[2748-50]. Although the process appears to be 
complicated, the BAC insert (red line) in the 
BGM can be copied and pasted into a linear 
plasmid to complete circularization. This pro- 
cess yields a plasmid carrying the copied DNA 
segment. The open and closed arrows are 
BAC, or pBR322 in some cases. Blue boxes 
crossing the B. subtilis genome show I-Ppol 
recognition sequence. The BAC or pBR vector 
(violet) carries the part which is derived from 
plasmid DNA and is capable of replication in 
B. subtilis. 


Retrieved 
BAC 


locus. The linear target DNA produced 
by I-Ppol digestion is clearly resolved 
from the BGM vector by pulsed-field gel 
electrophoresis, and readily isolated from 
agarose gels, as shown in Fig. 11. The 
isolation of giant DNA longer than several 
hundreds of kilobase pairs is in general 
very difficult because of its fragility in 
test tubes and the possible existence of 
contaminated nucleases. A clear example 
is detailed in Sect. 4.3. 
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The second method, genome dissection, 
largely depends on B. subtilis genetic 
systems. Despite the high potential to 
produce circular DNA larger than 300 kbp 
by this method [5, 47], its application has 
been restricted by the rather complicated 
procedures involved in its creation (M. 
Itaya, unpublished observations). 

The third method uses a yet more com- 
plicated genetic process referred to as 
BReT [1, 5, 12, 26, 27]. Unlike with the pre- 
vious disconnection method, in BReT the 
engineered inserts are copied and pasted to 
the existing plasmid, as shown in Fig. 10. 
The BReT protocol has yielded recombi- 
nant genomes in a completely circular 
form for lambda [27], mtDNA [1, 12], and 
several BACs [26] and other regions of 
B. subtilis [48, 49]. It should be mentioned 
that the BReT method produces duplicated 
regions — one in the plasmid and the other 
remaining in the B. subtilis genome — as 
seen in Fig. 10. The BReT process abso- 
lutely requires recA functions to conduct 
homologous recombination, but the RecA 
may have adverse effects on the struc- 
tural maintenance after completion. As 
reported previously [50], the transfer of 
plasmids to a recA mutant background 
may be needed to minimize unexpected 
alterations in structure. 


4.3 
Applications for Heavily Engineered BACS 
Produced in Transgenic Mice 


It is noteworthy that the first production of 
transgenic mice using the authors’ BGM 
pipeline was reported only recently [43]. 
Mus musculis contains a gene belonging 
to the fish-like OR (class I OR) gene fam- 
ily, which is located on chromosome 7 
and consists of 158 genes that form a 
huge single gene cluster. It has been re- 
ported that a cis-acting locus control region 


is expected to regulate transcription, al- 
though such a region has not been found 
in the class I OR gene family. Initially, all 
of the transgenes were constructed via the 
BGM system in order to locate functional 
regions. As shown in Fig. 11, on this oc- 
casion an implemented fragment (250 kb) 
was reconstructed by BGM and used for 
mouse transgenesis [43]. Figure 11 high- 
lights two neighboring BAC clones, of 115 
and 220 kbp, that may cover the plausible 
region, and their correct manipulations 
are described above. These clones par- 
ticipate in an insertion of the reporter 
gene encoding enhanced green fluores- 
cent protein (EGFP) into the target site 
region, without introducing an unwanted 
selectable marker, inversion of a BAC in- 
sert for suitable orientation (Sect. 4.2.1), 
connection of two contiguous BAC clones 
(Sect. 4.2.2), and purification by special 
gel electrophoresis (Sect. 4.2.3). The engi- 
neered DNA fragments isolated by I-Ppol 
digestion were microinjected into fertil- 
ized mouse eggs to generate transgenic 
mice. DNA that is pure and undamaged is 
required for microinjection into fertilized 
mouse eggs for transgenic production. 
The steps needed to collect undamaged 
giant DNA in liquids were examined, and 
an improved protocol using two gel elec- 
trophoresis runs, followed by collection 
on a dialysis membrane, made it possible 
to concentrate an undamaged giant DNA. 
The resulting purified 250kb mouse ge- 
nomic DNA carrying reporter genes was 
then used in the production of transgenic 
mice [43]. 

It should be emphasized here that the 
nucleotide sequence of the 250 kbp frag- 
ment used for microinjection acquired no 
mutations [43]. This was consistent with 
previous findings that the BGM operation 
did not result in the acquisition of any mu- 
tations (unpublished observations). This 
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Fig. 11 The first transgenic mouse using BAC engineered through the BGM system. The class | odorant 
receptor (OR) gene, indicated by small closes boxes located in chromosome 7 of a mouse, was mutage- 
nized. The first example, which relied on the BGM system alone, is detailed in Sect. 4.3 (see also Ref. [43]). 
Interestingly, transgenic mice carrying the enlarged transgene recapitulated the expression and axonal pro- 
jection patterns of the target class | OR gene in the main olfactory system. The most amazing result with 
respect to BGM manipulations is that the nucleotide sequence of the engineered 250 kbp fragment used for 
microinjection acquired no mutations through inversion induction. Insertion of the EGFP gene is indicated 
by the green arrowheads, and connection of separately made BACs [43]. P:n is the Pr-neo shown in Fig. 7b. 
Tg and inv represent transgene and inversion, respectively. 
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inherent nature of B. subtilis strongly 
supports the reliability of the BGM system. 
Experiments to produce other transgenic 
mice using the BGM system are ongoing 
(unpublished data). 

The first-ever use of engineered BACs 
for mutagenesis studies on a model animal 
were reported in 1997 [51]. The mam- 
malian genes included factors, promot- 
ers, 5’-untranslated regulatory sequences, 
introns, exons, splicing, and alternative 
splicing sites, 3’ long terminal repeats, 
and polyA addition sites. The functional 
accessory sequences, which contained en- 
coding small RNAs and noncoding RNAs, 
were regulated by, for example, methyla- 
tion, histone-binding, and nucleosomes. 
The study of these complex genes in- 
evitably requires the use of giant DNA 
ranging from dozens to hundreds of 
kilobase pairs. For such studies, engi- 
neered BACs in BGM have shed light on 
the future performance of more compli- 
cated and widely ranging target loci on 
chromosomes. 


4.4 
Comprehensive BAC/BGM Library 
Construction Proposals 


Currently, the direct preparation of BAC 
libraries for mouse genetic studies oc- 
curs fairly infrequently, and many re- 
search groups employ commercially avail- 
able, E. coli-based BACs and manipulate 
these themselves by using engineering 
kits. Alternatively, business-based breed- 
ing transgenic animals are available where 
mouse BACs are used for complicated 
genome engineering. Based on the com- 
prehensive BAC libraries for model ani- 
mals, most genome regions are covered 
by BACs that are available to the pub- 
lic. The production and application of 
heavily engineered BACs through a BGM 


system, as highlighted in the previous 
sections, indicates that the use of BGM 
from start to finish is fully recommended. 
The first demonstration of a full appli- 
cation to breeding a desired transgenic 
mouse should shed light on the BGM 
pipeline as a novel alternative approach 
that should attract a wide range of interest 
for both research and business applica- 
tions. Consequently, any sequence de- 
sign desired to produce transgenic mouse 
should be truly achievable, and therefore 
it is believed that BAC/BGM will offer 
a great opportunity for next-generation 
genetics, if allowed. It should also be rec- 
ognized that even the BAC transfer step 
to the B. subtilis genome (as described 
in Sect. 4.2 and Fig. 7a,b) must be a 
major bottleneck for nonspecialists and 
their ordinary research environments. It 
seems evident that to start from BAC al- 
ready in the B. subtilis genome should 
greatly “lower the bar” from a_ tech- 
nical standpoint. A proposal to convert 
the present BAC libraries to BGM li- 
braries, known as BAC/BGM libraries, is 
being planned, and steps to deliver BAC 
DNA from E. coli to BGM could become 
simpler and less laborious, as reported 
elsewhere [52-55]. The background the- 
ory and practical data in these reports 
are specific to E. coli and/or B. subtilis 
specialists; hence, to shorten this long 
and complicated story, the construction 
of a BAC/BGM library would provide 
an invaluable resource for comprehensive 
genomics and reverse genetics for the fore- 
seeable future. 


4.5 
Long-Term Storage of Valuable DNA 
Resources in the BGM System 


Escherichia coli transformant cells should 
be preserved at —80°C in the presence of 
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glycerol or dimethyl sulfoxide (DMSO), 
or at room temperature after being 
lyophilized. Unfortunately, these preser- 
vation techniques require certain facili- 
ties and spaces, while the cost of frozen 
stocks is also inevitably high because the 
deep-freeze apparatus consumes constant 
energy and must always be kept ready 
in the event of an emergency, such as 
electricity blackouts caused by earthquakes 
and tsunamis during recent years. Small 
DNAs, similar to those that serve as domi- 
nos in the pBR322, are resynthesized by 
PCR technology on a daily basis and need 
not be preserved over the long term. 
However, engineered BACs are invalu- 
able for future use and should be correctly 
stored because they are not readily repro- 
duced. 

In sharp contrast to E. coli, B. sub- 
tilis is capable of forming spores that 
survive for long periods in many haz- 
ardous environments, including drying 
out on a bench-top. Spores begin to 
germinate instantly when exposed to a 
nutrition-rich environment, and will form 
colonies the next day on an appropri- 
ate plate [56-58]. Spores of B. subtilis, 
a unique host for the BGM system, are 
thus expected to be ideal long-term and 
cost-free reservoirs of DNA, requiring no 
special equipment or facilities. Certain 
aspects of the stability of large DNAs 
integrated in the B. subtilis genome for 
months inside spores [26] were exam- 
ined initially, since when examinations 
have been commenced of longer-term 
preservations (M. Itaya, S. Kaneka, un- 
published observations). BGM libraries, if 
constructed correctly (indicated in Sect. 
4.4), are automatically converted to spores 
and may serve as long-term reservoirs for 
BAC, 


5 
Future Perspectives for the BGM System in 
Molecular Cloning and Genome Design 


Molecular cloning is a fundamental tech- 
nology, and the two types of inherent 
mouse DNA - mtDNA and chromosomal 
genomes -— require different forms and 
levels of technology for their genetic re- 
search. Furthermore, owing to the recent 
“burst” of genome sequence informa- 
tion, affordable genes are not limited to 
the existing genomes but may be ex- 
tended to those designed de novo. Indeed, 
current life science studies are moving 
rapidly toward more relevant synthetic bi- 
ology approaches, featuring the use of 
de novo-designed DNA pieces [7, 8, 22, 
28]. In particular, the use of DNA larger 
than the DNAs that have been easily han- 
dled in the past, as well as the use of 
synthetic DNA, has become increasingly 
prevalent. In all such cases a typical re- 
search cycle consisting of the design of 
genes/genomes, the actual construction, 
and in vivo testing and redesign has stim- 
ulated the pace of research, in addition 
to the traditional one-way genetics based 
on natural spontaneous mutations. Most 
research groups will have few complaints 
in this respect, as they may be largely 
satisfied with the familiar E. coli systems 
and unconcerned about the need for more 
complex DNA. However, in the present 
and in the short-term future, most in- 
vestigators will need to deal with DNA 
that has been designed de novo and syn- 
thesized in an innovative manner. In 
this respect, the BGM system has the 
unique advantage of being able to produce 
“any” desired DNA molecule, whereas 
the E. coli system would suffer certain 
limitations. 

In addition to providing an introduc- 
tion to the subject, this chapter has 
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focused on three topics with solid achieve- 
ments, namely mouse mtDNA synthesis, 
the application of chromosomal DNA by 
BAC to transgenic mouse production, and 
the long-term, less-expensive and low-risk 
preservation of valuable DNA molecules. 
It is hoped that the authors’ treatment of 
these topics will help readers to under- 
stand that the present and future produc- 
tion of DNA depends on the use of safe and 
sound BGM systems, in which B. subtilis 
168 plays an essential and central role. Yet, 
the utility of E. coli, without which inter- 
mediate DNA and BGM-based DNA could 
hardly be handled, must still be greatly 
appreciated. 

The above-described BGM system pro- 
vides a novel and powerful pipeline for 
DNA synthesis. Given any genes — re- 
gardless of whether they exist in Na- 
ture or have been designed — the system 
would be capable of synthesizing en- 
tire gene clusters that are testable for 
cells. 

Although the first mouse BAC trans- 
fer to the BGM system was reported 
only 10 years ago [25], the first applica- 
tion of the system from start to finish 
on transgenic mouse production has al- 
ready been achieved [43]. It has been 
shown clearly — as suggested earlier — that 
once the BAC has been transferred to 
the BGM host, then purpose-oriented 
modified DNAs could be easily prepared 
and produced via the BGM pipeline (as 
shown in Fig. 5). Today, DNA synthe- 
sis is continually and dramatically chang- 
ing the way in which advancing genetics 
and genomics are regarded in all of the 
life sciences. Moreover, the BGM sys- 
tem is expected to prove extremely use- 
ful, particularly for the creation of giant 
DNA. 
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Keywords 


Unnatural base pair 

An unnatural base pair is an artificial extra base pair that functions as a third base 
pair, along with the natural A~T(U) and G-C pairs, in replication, transcription, and/or 
translation. 


Genetic alphabet expansion 
Genetic information comprises the four natural base alphabet, and thus it could be 
expanded by introducing unnatural bases. 


PCR 
The PCR (polymerase chain reaction) isa DNA amplification method using thermostable 
DNA polymerases and primers, developed by Kary Mullis. 


T7 transcription 
T7 transcription is a system using T7 RNA polymerase to prepare RNA molecules. 


Real-time qPCR 
Real-time quantitative PCR is a method for detecting and quantifying target DNA 
molecules by monitoring PCR amplification; it is useful for disease diagnosis. 


SELEX (in-vitro selection) 

SELEX (systematic evolution of ligands by exponential enrichment) or in-vitro selection is 
an evolutionally engineered method for generating nucleic acid aptamers and ribozymes. 
Functional nucleic acids can be isolated from a library of random sequences, by repeated 
selection and amplification processes. 


Expansion of the genetic alphabet by an artificial extra base pair (unnatural base 
pair) could provide a new biotechnology in a synthetic biology area, enabling the 
site-specific incorporation of functional components of interest into nucleic acids and 
proteins. To store and propagate genetic information, unnatural base pairs require 
high exclusive selectivity as a third base pair in polymerase reactions. Recently, 
several unnatural base pairs that function in replication, transcription and/or 
translation have been developed. For replication, unnatural base pairs require 
highly selective complementarity in their base-pair formation. For transcription, 
unnatural base pairs with the unidirectional function to incorporate an unnatural 
base substrate into RNA opposite its pairing partner in templates can be used. For 
translation, the thermal stability and selectivity of the unnatural base pair formation 
in the codon—anticodon interaction are important. According to these categories, 
various unnatural base pairs are described, along with their applications. 
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1 
Introduction 


Expansion of the genetic alphabet is a new 
biotechnology achieved by the creation of 
an artificial extra base pair (unnatural base 
pair) that functions, along with the natu- 
ral A-T(U) and G-C pairs, in replication, 
transcription, and translation (Fig. 1). Al- 
though genetic alphabet expansion is a 
new area of synthetic biology, the idea 
dates back more than 50 years. In 1962, 
Alexander Rich commented that “In gen- 
eral, the trend in macroscopic organic 
evolution is toward increased complex- 
ity,’ and he proposed an artificial third 
base pair between isoguanine (isoG) and 
isocytosine (isoC) (Fig. 2) [1]. If the unnat- 
ural base pair functioned as a third base 
pair in the central dogma, then it could 
provide a higher evolution and increased 
functionality of nucleic acids and proteins, 
beyond the limitations of the four-letter 
genetic alphabet. Unnatural bases with 
different chemical and physical proper- 
ties or linkages with functional groups 
could be site-specifically incorporated into 
DNA and RNA by polymerase reactions, 
via the formation of the third base pair be- 
tween the unnatural base substrates and 
their pairing partners in templates. In ad- 
dition, new codon—anticodon interactions 
generated with unnatural base pairs could 
enable the site-specific incorporation of 
nonstandard amino acids into proteins by 
translation. 

From the late 1980s to the 1990s, 
Benner’s group reported a series of 
unnatural base pairs with different 
hydrogen-bonding patterns from those 
of the A-T(U) and G-C pairs, includ- 
ing the isoG-isoC pair [2-4]. They 
examined in-vitro polymerase reactions 
involving these unnatural base pairs, 
and demonstrated their potential in an 


artificially expanded genetic information 
system (AEGIS). This pioneering research 
resulted in the successful site-specific 
incorporation of a nonstandard amino 
acid, 3-iodotyrosine, into a 16-amino 
acid peptide by an in vitro Escherichia 
coli translation system [5]. The research 
also revealed some problems with the 
nonstandard, hydrogen-bonded unnatural 
base pairs. For example, because of the 
tautomerism of the isoG base, the enol 
form of isoG pairs with T, substantially 
reducing the isoG—isoC pairing selectivity 
in replication (Fig. 2) [6]. In addition, 
the isoC nucleoside is chemically 
unstable and decomposes under acidic 
or basic conditions [7]. Furthermore, 
the 2-amino group of isoC prevents 
interactions with some polymerases, 
unlike the 2-keto group of the natural 
pyrimidines [8]. 
Meanwhile, in 
reported another non-standard, 
hydrogen-bonded unnatural base 
pair between 6-thioguanine (G‘) and 
methyl-2-pyrimidinone (T) (Fig. 2), 
by combining the hydrogen-bonding 
patterns with the idea of steric exclusion 
[9]. The 6-thio group of G* excludes the 
pairing with C by the steric repulsion 
between the 6-thio group and the 3-amino 
group of C, and thus G° pairs with TH 
predominantly. Although the G‘-T# 
pair selectivity was not high, this novel 
idea was significantly expanded as a new 
concept by Kool’s group during the late 
1990s. Kool and colleagues synthesized 
a new set of nonhydrogen-bonded 
bases of 4-methylbenzimidazole (Z) and 
difluorotoluene (F), as isosteric mimics 
of A and T, respectively (Fig. 2) [10, 
11]. Replication experiments using these 
base analogs suggested that the shape 
complementarity of the pairing bases is 
important, while the hydrogen bonds in 


1988 Rappaport 


507 


508 


Synthetic Genetic Polymers Functioning to Store and Propagate Information 


ao dXTP, dYTP 
Replication 


Die TACG 


YTP or modified YTP 


Transcription 
Natural base NTPs 


RNA a 


Translation tRNA 


Guna) 
Cie Gan AY Phe \C Lys) 


Expansion of the genetic information by an unnatural base pair (X—Y). 


Protein 


Fig. 1 


a base pair are not required [12]. Thus, 
nonhydrogen-bonded base pairs, namely 
hydrophobic base pairs, were pursued for 
the third base pairs. 

To date, many unnatural base pairs have 
been designed and tested in polymerase 
reactions, and several have been devel- 
oped for polymerase chain reaction (PCR) 
amplification [13-22]. Here, the two types 
of unnatural base pairs that function in 
replication, transcription and/or transla- 
tion will be introduced: some function 
complementarily as both substrates and 
template bases of each pairing base with its 
exclusive selectivity, while others function 
unidirectionally as a substrate of one of 
the base pairs opposite its pairing partner 
in templates. Other unnatural base pairs 
that do not still function in polymerase 
reactions, and the modified nucleotides 
with natural base analogs or nonstandard 
sugars and phosphates, which are used in 
polymerase reactions, will not be discussed 
at this point. 


Natural base dNTPs 


ATGCY)GTTCAAG 
CAAGTTC 


Fe vvoanc 


2 
Unnatural Base Pairs 


2.1 
Bidirectionally Complementary Base Pairs 


2.1.1 Ds—Pa Pair 


A hydrophobic base pair between 


7-(2-thienyl)imidazo[4,5-b}pyridine (Ds) 
and _ pyrrole-2-carboaldehyde (Pa) was 
designed by fine-tuning the shape 


complementarity between pairing bases 
without hydrogen-bonding interactions 
(Fig. 3) [23]. The shapes of these 
hydrophobic bases differ from those of the 
natural bases, but they fit well together. 
The Ds base comprises a purine analog, 
imidazopyridine, and a thienyl group, 
and thus it is bulkier than the natural 
purine bases, A and G. The sterically 
large thienyl group of Ds efficiently 
prevents the mispairing with the natural 
bases. In contrast, the Pa base has a 
five-membered ring, which is smaller 
than the pyrimidine rings of C and T. 
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Fig. 2. Structures of natural and early unnatural base pairs. 


Like the keto groups of C and T, the 
aldehyde group of Pa is important for 
interactions with polymerases. Nuclear 
magnetic resonance (NMR) studies using 
its analog base pair in a duplex DNA 
revealed that the unnatural bases fit well, 
with edge-to-edge packing interactions 
like those of the natural base pairs. 

The Ds—Pa pair effectively works in T7 
transcription, and the substrates of Ds 
and Pa are both complementarily incor- 
porated into RNA by T7 transcription, 
with more than 94% selectivity when us- 
ing the same molar equivalents of the 
natural and unnatural substrates [23]. In 
addition, modified Ds and Pa bases can 
be incorporated into RNA by T7 tran- 
scription. An aminopropynyl group is 
attached to position 4 of the Pa base 
via 4-iodopyrrole-2-carboaldehyde, and the 
amino group is used for further mod- 
ification with biotin, fluorophore, and 


ethynyl groups [23-25]. A modified Ds 
base bearing an additional thiophene (Dss) 
is strongly fluorescent with 456 nm emis- 
sion, upon excitation at 385nm (Fig. 3) 
[26]. These modified Ds and Pa substrates 
are also specifically and efficiently incor- 
porated into RNA by T7 transcription, thus 
providing site-specific, functional labeling 
of large RNA molecules. 

The Ds—Pa pair also functions in repli- 
cation including PCR amplification. For 
high-fidelity PCR, y-amidotriphosphate 
modifications (Fig. 3) of Ds and A are 
required [23]. Slight amounts of the sub- 
strates of Ds (dDsTP) and A (dATP) are 
incorporated opposite Ds and Pa, respec- 
tively. The use of y-amidotriphosphates 
was found to effectively reduce the in- 
corporation of mispairs, such as the 
Ds—Ds and A-Pa pairs, in replication. 
By using the substrate mixture of the 
y-amidotriphosphates of Ds and A and 
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Fig. 3 Structures of unnatural base pairs that function in PCR. 


the usual triphosphates of Pa, G, C, 
and T, more than 99% selectivity of 
the Ds-—Pa pairing was achieved in 
PCR by exonuclease-proficient Vent DNA 
polymerase. 


2.1.2 Ds—Px Pair 

The Px (2-nitro-4-propynylpyrrole) base is 
an improved version of the Pa base (Fig. 3) 
[27, 28]. To address the A—Pa mispairing, 
the aldehyde group of Pa was replaced with 
the nitro group of Px, to electrostatically 
repel the oxygen of the nitro group with 
the 1-nitrogen of A. In addition, to increase 
the hydrophobicity of the Pa base for poly- 
merase interactions, a propynyl group was 
added to position 4. The selectivity of the 
Ds—Px pairing in PCR was improved to 
about 99.77—99.92%, depending on the se- 
quence contexts, by exonuclease-proficient 
Deep Vent DNA polymerase (Table 1) [28]. 
DNA fragments containing the Ds—Px 


pair were amplified 107-fold by 30 cy- 
cles of PCR. To evaluate the characteristic 
and the limitations of the Ds—Px pairing 
in PCR, the process of a 10-cycle PCR 
and dilution of the amplified DNA frag- 
ments was repeated up to 10 times. After 
100 cycles of PCR (10-cycles repeated 10 
times), more than 97% of the Ds—Px pair 
survived in the 1078-fold amplified prod- 
ucts [28]. Although purine-Ds-—purine se- 
quences are slightly less efficient, practical 
studies of the Ds—Px pair indicated that 
any sequence contexts can be amplified 
by Deep Vent or AccuPrime Pfx DNA 
polymerases. 

Any functional groups, such as biotin 
and fluorophores, can be attached to 
the Px base, and the modified Px sub- 
strates are also incorporated into DNA 
by PCR [27, 28]. The modification af- 
fects the selectivity of the unnatural base 
pairing. Some modifications reduce the 
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misincorporation rate of the unnatural 
base substrates opposite the natural bases 
in the template, without decreasing the un- 
natural base pair selectivity. For example, 
the unnatural base pair between Ds and 
diol-modified Px exhibits high pairing se- 
lectivity (99.77-99.92%/replication) and 
an extremely low misincorporation rate 
opposite the natural bases (0.005%/repli- 
cation/base), which is comparable to the 
rate of mispairing among the natural 
bases (Table 1) [28]. Thus, the unnatural 
base pairs can be retained at their initial 
positions over many PCR cycles. The fluo- 
rescent Dss base is also used as the pairing 
partner of Px in replication. 


2.1.3 5SICS-MMO2 and 5SICS-NAM 
Pairs 

The hydrophobic 5SICS-MMO2 and 
5SICS—NaM pairs (5SICS: 2-((2R,45,5 R)- 
4-hydroxy-5-(hydroxymethyl)tetrahydro- 
furan-2-yl)-6-methylisoquinoline-1 (2 H)- 
thione, MMO2: (2R,3S,5R)-2-(hydroxy 
methyl)-5-(2-methoxy-4-methylpheny]) 
tetrahydrofuran-3-ol, and NaM: (2R,3S, 
5 R)-2-(hydroxymethyl)-5-(3-methoxynaph- 
thalen-2-yl)tetrahydrofuran-3-ol) were 
created by screening a chemical library 
of nucleotides and optimization for 
replication (Fig. 3) [29, 30]. The high 
hydrophobicity of each unnatural base 
efficiently prevents the mispairings with 
the natural bases in replication. The in- 
corporation mechanism of the unnatural 
base pairs was investigated using NMR 
and X-ray crystallographic analyses of the 
duplex and the ternary complex with a 
polymerase [32, 33]. Unlike the natural 
base-pair structures, the unnatural base 
pairs adopt an intercalated structure in 
duplex DNA. However, in the closed 
complex with KlenTaq polymerase, the 
unnatural base in the substrate is coplanar 
with its complementary pairing partner 


in the template, similar to the natural 
base-pair structures. 

DNA fragments containing the 
5SICS—NaM pair were amplified 10!?-fold 
with high selectivity (more than 99.9%) 
by using OneTaq, a mixture of Taq 
and DeepVent (exot) DNA polymerases 
(Table 1) [29]. Although slight sequence 


biases were observed (less _ effective 
amplification when using sequences 
flanked with G-C pairs), extensive 


studies using random sequence libraries 
indicated that the biases are the same 
as those observed in the natural base 
sequence. The misincorporation rate 
of these unnatural base substrates 
opposite the natural bases in templates 
is 0.01-0.1%/replication/base (Table 1). 
The 5SICS-MMO2 pair is less effective, 
relative to the 5SICS—NaM pair, but both 
unnatural base pairs also function in T7 
transcription [34, 35]. The MMO2 base 
can be functionalized through a propyne 
linker, and a biotinylated dMMO2TP was 
synthesized for PCR. 


2.1.4 P-Z Pair 

The nonstandard hydrogen-bonding, 
unnatural base pair between 2-aminoimi- 
dazo[1,2-a]-1,3,5-triazin-4(8 H)-one (P) and 
6-amino-5-nitro-2(1H)-pyridone (Z) was 
developed by Benner’s group (Fig. 3) 
[31, 36-38]. Their initial base pairs, such 
as isoG—isoC, have several shortcom- 
ings — for instance, the tautomerism of the 
isoG base, the chemical instability of the 
isoC nucleoside, and the low accessibility 
of the isoC base to polymerases. To 
address these problems, Benner et al. 
created the P—Z pair. The P base has 
no significant tautomerism and adopts 
the nonstandard donor—donor-acceptor 
hydrogen-bonding pattern, — selectively 
pairing with the Z base with an 
acceptor—acceptor—donor pattern. In 
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the Z base, the nitro group, which is 
a strong electron-withdrawing group, 
stabilizes the nucleoside derivatives 
against epimerization [38]. The structure 
of the minor groove side of the P—Z pair 
is very similar to those of the natural 
base pairs, and thus these unnatural 
bases interact very efficiently with 
polymerases. 

The selectivity of the P-Z pairing and 
the misincorporation rate of the unnatu- 
ral base substrates opposite the natural 
bases, under optimized conditions us- 
ing Taq DNA polymerase, are 99.8 and 
0.2% per replication, respectively (Table 1). 
To achieve this fidelity, the concentration 
of each substrate was precisely adjusted: 
0.1mM dATP, dGTP, and TTP, 0.6mM 
dCTP and dPTP, and 0.05 mM dZTP [31]. 
These unnatural base substrates can be in- 
corporated opposite up to four consecutive 
unnatural base-pairing partners. 


2.1.5 S—Cu-S Pair 

Recently, Carell’s group created novel un- 
natural self-base pairs between salicylic 
aldehydes (S), through an inorganic metal 
(Cu) cross-interaction and the formation 
of a reversible ethylene diamine bridge 
(Fig. 3) [39]. The X-ray crystallography 
analysis confirmed that the C1’/—C1’ dis- 
tance of the S—Cu-S pair (11.4 A) in the 
DNA duplex is very close to that of the A-T 
pair (10.4 A). In a polymerase reaction us- 
ing Bst Pol I, the addition of Cu** and 
ethylene diamine significantly reduced the 
misincorporations of the natural base sub- 
strates opposite the unnatural base, es- 
pecially A-incorporation opposite S. The 
S-Cu-S pair in DNA duplexes is ther- 
mally disassembled, and thus it can be 
used for PCR. PCR amplification was per- 
formed by KOD (Thermococcus kodakarae- 
nis) XL DNA polymerase using 0.1mM of 
each base dNTP, including dSTP, 10 mM 


ethylene diamine, and 0.75mM CuSOg, 
but the elongation time had to be ex- 
tended to 15 min. After 40 cycles of PCR, 
the unnatural bases were identified in the 
amplified DNA by a nucleoside composi- 
tion analysis. Although the exact selectivity 
and amplification efficiency have not been 
reported, future applications using the 
unique S—Cu-S pair are expected. 


2.2 
Unidirectionally Complementary Base Pairs 


2.2.1 isoG—isoC Pair 

The isoG—isoC pair can be used unidirec- 
tionally for the incorporation of the isoG 
substrate opposite isoC in templates, in 
replication, and transcription [6]. Due to 
the several shortcomings described in the 
previous section, the isoG—isoC pair can- 
not be used in complementary replication, 
except under specific conditions [40], in 
which the isoG—isoC pair is frequently re- 
placed with the A-T pair. This is because 
of the mispairing between the enol form 
of isoG and T. In addition, because of 
the lack of the 2-keto group in isoC, the 
incorporation efficiency of the isoC sub- 
strate opposite isoG is lower than that of 
the T or U substrate opposite the enol 
form of isoG. Therefore, isoC cannot be 
used as the substrate opposite isoG in 
templates. However, the A incorporation 
opposite T or U is much more efficient 
than the isoG incorporation opposite T, 
and thus the combination of the isoG 
substrate and isoC-containing templates 
can be used in replication and transcrip- 
tion. Furthermore, the amino group of the 
isoG base can be modified with several 
functional groups, and the modified isoG 
substrates are also incorporated into DNA 
and RNA opposite isoC in replication and 
transcription [41, 42]. 
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Fig. 4 Structures of unnatural base pairs that function in T7 transcription. 


2.2.2 s—y Pair 

The hydrogen-bonded unnatural base 
pair between 2-amino-6-(2-thienyl)purine 
(s) and pyridine-2-one (y) is useful for 
the site-specific incorporation of modi- 
fied y substrates into RNA, opposite s 
in templates, by T7 RNA polymerase 
(Fig. 4) [43, 44]. The s base was de- 
signed by attaching a thienyl group 
to position 6 of 2-aminopurine. Al- 
though the hydrogen-bonding pattern of 
2-aminopurine also fits that of T, the 
thienyl group of s sterically and electro- 
statically clashes with the 2-keto group of 
T, preventing the s-T mispairing. Instead 
of the 2-keto group of T, the hydrogen of y 
improves the shape-complementarity with 
s. Although the y substrate is incorporated 
opposite A to some extent, the efficiency 
of y incorporation opposite A is lower than 
that of T incorporation opposite A. Thus, 
the y substrate is specifically incorporated 
only opposite s in T7 transcription. In ad- 
dition, the y base can be modified with 
functional groups via a propynyl linker, 
and these modified y substrates have been 
used in transcription [18]. 


2.2.3. s—Pa Pair 

For s incorporation into RNA, the s—Pa 
pair is superior to the s—y pair (Fig. 4) [45]. 
Due to the high shape-complementarity 
between A and y, with one hydrogen- 
bonding interaction, slight misincor- 
poration of the A substrate occurs 


opposite y in the template. In contrast, 
the five-membered ring of Pa efficiently 
reduces the shape-complementarity with 
A, and instead it further improves the 
shape-complementarity with s. Thus, the s 
substrate is site-specifically and efficiently 
incorporated into RNA, opposite Pa 
in templates, by T7 transcription. The 
nucleoside of s emits fluorescence 
centered at 430nm, characterized by 
two major excitation maxima (299 
and 352nm), and its fluorescence is 
significantly stronger than that of the 
well-known 2-aminopurine [45-47]. 


3 
Applications 


3.1 
PCR Amplification and qPCR 


The isoG—isoC and Ds—Px pairs have 
been utilized in real-time PCR to de- 
tect and quantify target nucleic acid se- 
quences in a sample. Among them, the 
Plexor system, with the isoG—isoC pair, 
is now in practical use in multiplex qPCR 
for diagnostics (Fig. 5) [42, 48-53]. This 
system uses a quencher-linked isoG sub- 
strate and a PCR primer, in which an 
isoC nucleotide and a fluorophore are 
present at the primer 5’-terminus. The in- 
corporation of the quencher-linked isoG 
opposite isoC in the primer quenches 
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Fig. 5 Real-time PCR systems using unnatural base pairs. 


the fluorescence. By attaching several 
fluorophores in unique primers targeting 
each sequence, the Plexor system can be 
used for multiplex analyses. 

In the Ds-—Px pair system, the 
2-nitropyrrole moiety of the Px base acts 
as a quencher, and thus the combination 
of the quencher Px base with a modified 
Ds, the fluorescent Dss base, can be 
used for simple qPCR methods [54]. In 
real-time PCR using a Dss-containing 
primer and dPxTP, the fluorescence is 
quenched by Px incorporation. Another 
PCR system utilizes the combination of 
a fluorophore-linked Px base with Ds. 
Fluorescence of the fluorophore-linked 
Px substrates was found to be partially 
quenched, and after their incorporation 
into DNA, the fluorescence was recovered 
[55]. In the substrates, the fluorophore 
moiety stacks or collides with the Px 
base, resulting in quenching. However, 
in duplex DNAs, the fluorophore moiety 


protrudes outside the duplexes, increasing 
the fluorescence. Thus, this phenomenon 
can be applied to real-time qPCR by 
using a Ds-containing primer and a 
fluorophore-linked Px substrate (Fig. 5) 
[55]. Since the Ds—Px pair system is used 
for the complementary incorporation of 
each unnatural base, Ds-containing DNA 
fragments can also be detected as the 
target DNA sequences by these real-time 
PCR methods, which are used for 
DNA authentication and steganography 
applications. 

Recently, the Ds—Px pair system was 
applied to in vitro selection (systematic 
evolution of ligands by exponential en- 
richment, SELEX), to generate new DNA 
aptamers that specifically and tightly bind 
to a target protein [56]. A SELEX method 
was developed using a random sequence 
DNA library containing five different 
bases: A, G, C, T, and Ds. The gen- 
erated DNA aptamers containing a few 
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Ds bases exhibited substantially higher 
affinities and specificities, relative to the 
aptamers comprised of only the natural 
bases. 


3.2 
Transcription 


The modified isoG, y, Pa, MMO2, and 
5SICS bases are site-specifically incorpo- 
rated into RNA, opposite their pairing 
partners in templates, by T7 transcription. 
Several functional groups, such as iodo, 
aminoalkyl, ethynyl, biotin, fluorophores, 
and quenchers, can be attached to these 
unnatural bases [18, 23-25, 35, 41, 57-60]. 
Among them, the ethynyl modification of 
the Pa base provides an efficient label- 
ing method for large RNA molecules, by 
combining the post-transcriptional mod- 
ification with functional azide reagents 
of interest through click chemistry [25]. 
This post-transcriptional method is partic- 
ularly advantageous for labeling with large 
functional groups, because the direct in- 
corporation of modified unnatural base 
substrates with large functional groups 
reduces the transcription efficiency and 
selectivity. The transcriptional incorpora- 
tion of modified Pa bases can be applied to 
the labeling of large RNA molecules, with 
lengths over 200 nucleotides [61]. DNA 
templates containing Ds are prepared and 
amplified by a fusion PCR method. 

The fluorescent s and Dss bases can 
also be incorporated into RNA, opposite 
Pa in templates, by T7 transcription [26]. 
The fluorescence intensity of the s base 
varies depending on the stacking features 
between s and its neighboring bases, 
and thus the site-specific s labeling is 
useful for analyzing the local areas of the 
structural features and thermal stabilities 
of functional RNA molecules [45, 47]. 


3.3 
Translation 


Unnatural base-pair systems could expand 
the genetic codon table. By adding an 
extra base pair, the combination of codons 
(three-base | sequences) theoretically 
increases by 216 (63), relative to the 
64 combinations (4°) conferred by the 
natural base pairs. In the expanded 
system, nonstandard amino acids can be 
incorporated site-specifically into proteins 
by translation. To date, translation 
systems using the isoG—isoC and s—y 
pairs have been reported. 

Benner’s group successfully employed 
in-vitro translation to produce a 16-amino 
acid peptide containing 3-iodotyrosine, 
using a synthesized mRNA containing 
an isoCAG codon and 3-iodotyrosyl-tRNA 
containing a CUisoG anticodon [5]. An 
in-vitro translation system coupled with 
T7 transcription by the s-y pair was 
also reported [44]. In the system with 
a yAT codon and a CUs anticodon in- 
teraction, 3-chlorotyrosine was introduced 
site-specifically at position 32 in the hu- 
man Ras protein (185 amino acids), us- 
ing mRNA (747-mer) containing the yAT 
codon prepared by T7 transcription of a 
DNA template containing s (Fig. 6). In 
the report, the 3-chlorotyrosyl-tRNA con- 
taining a CUs anticodon was prepared 
by the ligation of chemically synthesized 
RNA fragments. By using the s—Pa pair, 
the tRNA molecule containing s can also 
be prepared by T7 transcription, using 
Pa-containing DNA templates. 


4 
Conclusion 


The unnatural base pair systems that func- 
tion as a third base pair in the genetic 
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information flow have been summarized. 
By creating unnatural base pairs, the 
present recombinant DNA technology that 
is restricted to the two natural base pairs 
could be advanced to extra-component 
integrated genetic engineering. As most 
of these unnatural base pairs have only 
recently been developed, only a few appli- 
cations have been reported, new methods 
and novel products will be developed by 
unnatural base pair systems in the fu- 
ture. However, there have as yet not been 
any reports of unnatural base pairs that 
function in the entire central dogma, 
from replication to translation, and this 
is one of the major goals of the next 
few years. Eventually, the genetic alpha- 
bet expansion could be applied to artificial 
living cells and organisms, as a research 
tool for synthetic biology, to track tar- 
get gene expression, as a biosafety tool 
for recombinant organism containment, 


and as an efficient production system for 
artificial proteins with nonstandard amino 
acids. 
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Synthetic biology 


In contrast to analytical biology, which examines how natural systems work, synthetic 
biology uses biological components to engineer living systems with properties that are 


designed rather than which have evolved. 


Synthetic morphology 


A subfield of synthetic biology that aims to engineer anatomical structures. 


Bioengineering 


The engineering of, or with, living cells, tissues, or organisms. 


Feedback 


Use of the result of a process to control how the process operates; for example, use of 
the temperature of a room to control the power of a room heater. 


Stem cell-based approaches to regenerative medicine are promising, but face 
several serious unsolved problems before their clinical use becomes possible. These 
include the reliable production and expansion of stem cells, reliable differentiation, 
avoidance of tumorigenicity, efficient integration into the host, population control 
within the host, and monitoring of cell behavior once integrated. Synthetic 
biology — the engineering of new, designed functions into cells — offers potential 
solutions to each of these problems (albeit with possibly additional issues of 
regulation and public acceptance). This potential of the technique is illustrated in 
this chapter by discussing how existing synthetic systems could be adapted for stem 


cell use. 


1 
Introduction: Current Problems in 
Regenerative Medicine 


In this chapter, the ways in which syn- 
thetic biological techniques might be used 
to solve problems that are currently hin- 
dering the development of reliable and 
effective regenerative therapies will be dis- 
cussed. The intention is not to provide an 
exhaustive list of either problems or so- 
lutions; rather, a small number of very 
important problems has been selected 
to illustrate how synthetic biological ap- 
proaches might be of assistance. 


Regenerative medicine is a_ broad 
field that includes a range of therapies 
from molecules to cells, to engineered 
matrices and decellularized tissues [1, 2]. 
A large proportion of current research 
and development work, in both academic 
and industrial sectors, currently focuses 
on cell-based therapies—in particular, 
the production, differentiation, and 
practical use of human stem cells. 
Hence, attention will also be focused on 
cell-based therapies, which have already 
shown promising results in animal 
and human trials in the treatment of 
conditions including sickle cell anemia 
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[3], Parkinson’s disease [4], acute renal 
failure [5], and ischemic stroke [6]. 

Cell-based therapies usually employ 
stem cells of various types. The first to 
be used were tissue (‘adult’) stem cells, 
with bone marrow-derived stem cells be- 
ing used even prior to 1970 [7]. Tissue stem 
cells are assumed to be multipotent (not 
totipotent), being restricted in the range of 
cell types into which they are able to differ- 
entiate. They are also limited in the extent 
to which they can propagate in culture [8). 
These limitations, as well as difficulties in 
sourcing sufficient quantities of such cells, 
have so far restricted their usefulness in 
therapeutic settings [8, 9]. Embryonic stem 
cells (ESCs), derived from the blastocyst, 
can propagate indefinitely in culture and 
retain the ability to differentiate into any 
cell type, giving them obvious potential for 
regenerative medicine [9-13]. The fact that 
ESCs can be derived only from an embryo 
is, however, a problem both technically 
and ethically, and this has motivated the 
search for alternatives [8, 9, 14]. 

Early studies of amphibian and mam- 
malian cloning introduced the possibility 
of returning a nucleus from a fully differ- 
entiated somatic cell back to a pluripotent 
state, by exposing it to cytoplasmic fac- 
tors characteristic of pluripotent cells [15}. 
This inspired the search for the set of 
cytoplasmic/nucleoplasmic proteins that 
would mediate this return to pluripoten- 
tiality. In 2006, Takahashi and Yamanaka 
showed that the expression of four spe- 
cific transcription factors was sufficient 
to make differentiated cells pluripotent 
again — albeit at low efficiency — in rodents 
[16], and again in 2007 in human cells [17]. 
Such pluripotent cells are termed induced 
pluripotent stem (iPS) cells [16], and of- 
fer the advantages of ESCs in potential 
future therapies, while avoiding both the 


problem of ethical objections and, if autol- 
ogous cells are used, the risk of immune 
rejection [13, 18]. 

Whether the stem cells used are ESCs 
or iPS cells, there are significant obstacles 
to their clinical use [9, 13, 19]. Some, such 
as the need to guarantee freedom from 
pathogens and pyrogens, are common to 
all medicines and are conceptually trivial 
even if difficult to ensure in practice: they 
will not therefore be discussed here. Other 
problems are unique to stem cells. These 
include: 


e A need for the reliable production and 

expansion of cells. 

A need for the control of potency and 

differentiation state. 

A need to avoid the risk of tumorigenesis 

in the host. 

A need to target the cells efficiently in 

the host. 

A need to ensure that the graft grows to 

the appropriate size for the host. 

e The requirement to monitor what is 
happening in the host (this is not 
essential, but is very helpful if it can 
be arranged). 


The potential application of synthetic 
biology to each of these problems will be 
discussed after a brief overview of the field 
of synthetic biology. 


2 
Synthetic Biology as a Promising Source of 
Solutions 


The term “synthetic biology” has two widely 
used meanings: (i) the artificial building 
of living systems (‘the creation of life” 
from chemical components); and (ii) the 
building of artificial living systems (cells, 
tissues, and organisms that have been 
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engineered for functions they had not nat- 
urally evolved to perform). It is this second 
sense — the engineering of new functions 
into living cells — that has the most imme- 
diate potential for regenerative medicine. 
In general, synthetic biologists create new 
cellular functions by engineering novel ge- 
netic networks into the genome of the cell, 
often exploiting a rich repertoire of genes 
and gene-fragments from other organisms 
[20]. Typically, these synthetic networks 
involve information processing and “com- 
putation” as well as metabolic functions, 
allowing them to be responsive to the en- 
vironment of the cell that bears them. 
Before the way in which synthetic biol- 
ogy can be applied to solving problems in 
stem cell biology is discussed, a descrip- 
tion will be provided of a simple synthetic 
network that changes state according to its 
environment, in order to illustrate how a 
typical artificial gene system functions in 
a mammalian cell. 

This synthetic network is an analog of 
evolved systems that activate gene expres- 
sion when the amount of an external signal 
(e.g., a morphogen in a concentration gra- 
dient) is between two threshold values: the 
gene is OFF when the signal (the concen- 
tration of morphogen) is either above the 
high threshold or below the low threshold 
[21]. The authors call this a ‘“‘band-pass” 


[ 


k— External signal 


Right arm 


Output 
(a) i+f 


Fig. 1 


filter although, since the signal is being 
interpreted rather than simply passed on, 
this nomenclature is not strictly appropri- 
ate. In this system (Fig. 1a), the output 
gene is coupled to a promoter that can be 
repressed by either of two transcriptional 
repressors. One of these, on the left side 
of the figure, is made by a gene under 
the control of a transcriptional activa- 
tor, the expression of which is suppressed 
by the external signal. When amounts of 
this external signal are low, the activator is 
made, so a repressor of the final output is 
made, and the output is held OFF. When 
the amount of external signal is high, no 
repressor is made and this left arm of the 
network does not repress transcription of 
the output gene. The other transcriptional 
repressor, in the right arm of the pathway, 
is itself controlled by a repressor that is 
transcribed in response to the activator. 
This arm therefore makes no repressor of 
the final output when there is no exter- 
nal signal, but when the external signal is 
high enough, the repressor is made and 
the output is shut OFF. When parameters 
are chosen correctly, there is a window of 
concentration at which neither arm of the 
network represses the output, and it is free 
to go ON (Fig. 1b). 

Although the simple example network 
described above used just one control 


Not repressed 


+ Repressed Repressed 
2 by left by right 
= 

6 arm arm 

(b) Concentration of external signal 


Operation of a synthetic switch by a concentration 


gradient [21]. (a) Design of the switch; (b) The output gene 
is transcribed only in a limited window of morphogen con- 


centration (the external signal). 


Synthetic Biology Approaches for Regenerative Medicine 


system (transcriptional, modulating the 
expression of genes), many types of control 
are available. Originally, most synthetic 
logic gates functioning in mammalian 
cells relied on the activation or repres- 
sion of transgene promoters through the 
response of transcription factors to small 
molecules in the environment of the cell. 
Intricate switches have been developed 
in which expression of transgenes can 
be switched on or off upon the addi- 
tion of an antibiotic or a combination of 
antibiotics [22]. Recently, such synthetic 
circuits have been engineered to respond 
to small (diffusible, only weakly immuno- 
genic) molecules such as the apple-derived 
metabolite phloretin [23], vitamin H [24], 
and vanillic acid [25]. 

Engineered proteins such as zinc- 
fingers or transcription activator-like 
effectors (TALEs) can be used in synthetic 
biology as robust, specific transcriptional 
activators or repressors. Synthetic zinc- 
finger repressors have been used to reduce 
expression of the mutant gene responsible 
for Huntington’s disease in the brains of 
mice [26], and zinc-finger DNA-binding 
domains have been fused to a core clock 
protein to regulate expression of a target 
gene in an oscillating circadian rhythm 
[27]. TALEs have also been orthogonally 
engineered to activate or repress synthetic 
transgenes with very high specificity, 
without effecting endogenous genes [28]. 

Transcriptional regulation can also be 
achieved using epigenetic switches, as was 
demonstrated by Kramer et al. [29] with an 
engineered epigenetic toggle switch. The 
successive administration of two antibi- 
otics stably switched the expression of a 
human glycoprotein in Chinese hamster 
ovary cells between two different states, 
functioning as expected even after implan- 
tation of the engineered cells into mice. 


Translational control offers the poten- 
tial for high-speed modulation of pro- 
tein synthesis. RNA-based regulators have 
been developed for the rapid modulation 
of transgene expression [30], as protein 
translation and post-translational modi- 
fications are the only rate-limiting steps 
left. In 2004, Yen et al. used RNA 
self-cleavage to regulate transgene expres- 
sion in mammalian cells, by incorporating 
an inducer-responsive ribozyme in the 
mRNA sequence of the transgene [31]. In 
the absence of an inducer, the ribozyme 
cleaved itself, targeting the resulting trun- 
cated mRNA for degradation. If the ap- 
tamer domain of the riboswitch bound its 
inducer, it underwent a conformational 
change that inhibited the ribozyme’s activ- 
ity and enabled the mRNA to be translated 
to protein. More recently, Chen et al. used 
the same design to control T lymphocyte 
proliferation in mice [32]. This riboswitch 
regulated expression of a cytokine im- 
portant for T lymphocyte survival; in the 
absence of the inducer, without expression 
of the cytokine, the T lymphocytes died. 

Deans et al. designed a genetic switch 
that combined both transcriptional and 
translational levels of control [33] (Fig. 2). 
When the switch was not induced, the 
constitutively expressed repressor Lacl 
blocked the expression of the trans- 
gene and of a second Tet repressor 
(TetR), thus permitting expression of a 
small hairpin RNA (shRNA) targeted to 
the transgene mRNA (in this way, if 
the switch were leaky at the transcrip- 
tional level, the shRNA would block 
translation of any transgene mRNA pro- 
duced). When the switch was induced 
by isopropylthio-B-galactoside (IPTG), the 
transgene was expressed as well as TetR, 
which in turn repressed expression of the 
shRNA, and the transgene mRNA could 
be translated. 
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Fig. 2. A synthetic mammalian genetic switch 
that combines transcriptional and translational 
repression for tight, tunable, and reversible 
control [33]. (a) Switch OFF: expression of 
the reporter is repressed by the Lacl repressor 
that also represses expression of the Tet re- 
pressor. Any leaky transcription of the reporter 


Today, faster, robust, and highly 
spatially regulated  post-translational 
regulatory systems are emerging, in 
which membrane receptors activate 
signal transduction cascades to generate 
rapid responses in the target cell [34]. In 
an interesting example, Levskaya et al. 
demonstrated a reversible light-switchable 
control of protein-protein interaction 
in mammalian cells based on the plant- 
derived protein Phytochrome B [35]. These 
authors adapted a signaling pathway 
from Arabidopsis thaliana in which, 
upon light stimulation, Phytochrome 
B binds a downstream transcription 
factor and regulates the transcription of 
response genes. As this system allowed 
a rapid and precise translocation of 
target proteins, the group was able to 
change the morphology of mammalian 
cells by directing upstream activators of 
Rho-family GTPases, which control the 
actin cytoskeleton, to the cell membrane. 


(b) 
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is repressed at the translational level by RNA 
interference; (b) Switch ON: upon addition of 
IPTG, which binds the Lacl repressor, both 
repressions are relieved and the reporter is ex- 


pressed. TetR is also expressed and represses 
transcription of short hairpin RNA (shRNA). 


The mammalian synthetic regulation 
tools described above allow for the con- 
struction of complex synthetic networks 
with tight control over expression of trans- 
genes in engineered stem cells both in 
vitro and in vivo. However, it is worth 
noting that while mammalian synthetic 
biology systems have been constructed 
and have been shown to function well 
[36-40], the cycles of design and construc- 
tion tend to be slower than is the case for 
the more numerous prokaryotic synthetic 
systems. Eukaryotes have evolved intricate 
ways of regulating gene expression, and 
their highly compartmentalized structure 
adds an even greater complexity to the en- 
gineering of synthetic cellular pathways. 
There is also a high risk, while working 
with components sourced from natural 
organisms, that there may be cross-talk 
between the synthetic and endogenous 
systems [41]. 
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ZA 
Production and Expansion of Stem Cells 


The sourcing or generation of stem cells 
can be challenging, and the subsequent 
process of expanding such cells outside 
a living host presents its own problems 
[8]. Since Takahashi and Yamanaka’s 2006 
breakthrough in generating iPS cells by 
the viral transduction of four key transcrip- 
tion factors (KLF4, c-MYC, OCT4, SOX2) 
[16], studies have been ongoing to over- 
come the challenges associated with the 
induction and differentiation of pluripo- 
tent stem cells. The risk of tumorigenesis, 
which was inherent in the original ap- 
proach to generating iPS cells, demanded 
that safer and more reproducible meth- 
ods must be developed before clinical 
applications could be considered [13, 42]. 
Diverse techniques are now available for 
generation of iPS cells without the risk 
of insertional mutagenesis or viral reacti- 
vation [13]. However, this represents only 
one aspect of the problem. 

In vivo, stem cells exist in a niche in 
which other cells provide them with physi- 
cal protection, survival signals and also, 
sometimes, signals that drive prolifera- 
tion, differentiation, or both. During stem 
cell expansion in vitro, the culture con- 
ditions and manipulations are critical to 
ensure that stem cells are maintained ap- 
propriately and that oncogenic mutations 
are not inadvertently selected [8, 9, 43, 44]. 
Thus, determined efforts have been made 
during recent years to develop feeder-free 
methods of culturing stem cells, partly to 
reduce variability in populations of cul- 
tured cells [45], and partly to reduce the 
risk of contamination. However, it could 
be argued that a feeder layer of engineered 
cells, modified to mimic an appropriate 


niche, could help both to maintain ho- 
mogeneous cultures of stem cells, and to 
prevent differentiation until it is required. 

In vivo, communication between stem 
cells and their niche is two-way, with 
signals being sent to stem cells depend- 
ing in part on the signals they them- 
selves send out. The construction of 
engineered systems to provide appropriate 
stem cell-expanding signals will require 
the connection of normal or artificial re- 
ceptors (e.g., Fab-enzyme fusions) to a 
synthetic information-processing network 
to control the production of the required 
signaling molecule. Although this has 
not yet been achieved, simple synthetic 
mammalian signaling circuits have been 
constructed and have provided a first 
proof-of-principle. 

In one recently reported system, 
transmitter cells constitutively express 
an enzyme that converts indole in the 
surrounding medium into tryptophan 
[46]. The receiver cells respond in a 
dose-dependent manner through a 
tryptophan-dependent transactivator that 
controls the transcription of a reporter 
gene. Building on this system, the authors 
designed a complex two-way communi- 
cation network where two cell lines were 
engineered—the transceiver cells and 
processor cells (Fig. 3). The transceiver 
cells have the integrated tryptophan 
synthase module previously described, 
and convert indole into tryptophan as 
a signal for the processor cells. The 
processor cells are programmed with 
tryptophan-inducible reporter expression 
to acknowledge processing of the signal, 
but also inducibly express an enzyme that 
converts ethanol present in the medium 
into acetaldehyde. The second module 
integrated into the transceiver cells 
translates the acetaldehyde signal into 
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Fig. 3 A synthetic mammalian two-way com- 
munication system [46]. Transceiver cells con- 
stitutively express tryptophan synthase, which 
converts indole from the medium into a tryp- 
tophan signal. Tryptophan induces expression 


expression of a second reporter, complet- 
ing the two-way communication system. 
The system functioned as expected: the 
authors were able to record an increase 
in the production of the reporter linked 
to tryptophan expression, followed by an 
increase in the level of the reporter linked 
to acetaldehyde expression. 

Such intercellular communication sys- 
tems could be used between iPS cells 
and engineered companion cells, where 
the companions sense a signal output 
from the iPS cells and secrete appropri- 
ate cytokines in response. Programmed 
homeostasis through quorum-sensing be- 
havior (of the type that will be discussed in 
more detail below) could be engineered in 


Acetaldehyde 


Ethanol 


of a first reporter and of alcohol dehydro- 
genase, which converts ethanol from the 
medium into an acetaldehyde signal. Acetalde- 
hyde induces expression of a second reporter 
from the processor cells. 


companion cells to maintain their popula- 
tion density to the minimum necessary to 
assist proliferating iPS cells in vitro. 


2.2 
Control of Differentiation/Potency 


The differentiation of stem cells in vitro is 
associated with two problems. The first 
problem is the need to drive efficient 
differentiation along the required pathway, 
and the second problem is the need to 
ensure that undifferentiated or wrongly 
differentiated cells are not present in the 
population administered to the patient, in 
order to limit the risks of tumorigenesis 
[8, 19]. 
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Traditionally, differentiation has been 
promoted by adding purified growth fac- 
tors (e.g., Wnt and BMP proteins and 
their inhibitors) to cultured stem cells [47]. 
Differentiation in vitro is much less or- 
ganized and predictable than in vivo [48], 
most likely because the rapid feedback 
responses from the niche to the differ- 
entiating and yet-to-differentiate cells are 
missing. One possible solution would be 
to engineer companion cells, not only to 
maintain stem cell proliferation for the 
required period, but also to activate dif- 
ferentiation on demand and to engage in 
dynamic signaling to steer the process in 
the desired direction. 

There are well-established precedents 
for using genetically modified cells in co- 
culture with stem cells: the OP9-DL1 cell 
line, which has been modified to express a 
key receptor ligand on its surface, was first 
used to direct the differentiation of stem 
cells to mature T lymphocytes over 10 
years ago [49]. With engineered cells sens- 
ing and responding to stem cells, a high 
level of precision is possible in timing the 
delivery of key signals to cells depending 
on their individual requirements, result- 
ing in a more homogeneous population 
[47]. Given the notoriously poor efficiency 
of the reprogramming process [50], a sim- 
ilar argument could be made for using 
engineered cells to manufacture and trans- 
fer the reprogramming factors directly to 
the closely associated would-be iPS cells. 

Another synthetic approach that might 
impact on the differentiation of human 
stem cells is the use of modular, novel 
proteins. Haynes and Silver used syn- 
thetic Polycomb-based artificial transcrip- 
tion factors to specifically reverse the 
epigenetic silencing of a number of devel- 
opmentally significant genes [51]. These 
authors argued that such a technique could 
be used to reverse silenced developmental 


regulators required for stem cell differen- 
tiation. 


23 
Addressing the Danger of Tumorigenicity 


It has long been appreciated that the na- 
ture of stem cells makes tumor formation 
a significant risk in stem cell therapies. 
The formation of teratomas (tumors that 
consist of cell types associated with all 
three germ layers) has, after all, long 
been used as an assay of pluripotency 
[8]. The introduction of cells still in their 
pluripotent state, rather than already dif- 
ferentiated (or at least committed to do 
so), greatly increases the risk of tumori- 
genesis [8, 19]. Even where the risk from 
healthy stem cells can be controlled, there 
is an additional danger from oncogenic 
transformation during the adaptation of 
cells to culture and, in the case of iPS 
cells, through unintended consequences 
of the methods used to derive them 
[52, 53]. iPS cells have been observed to 
display chromosome variations during the 
course of culture, emphasizing the impor- 
tance of controlling the selection pressures 
that are applied to such cells in vitro [54]. 
Early studies with chimeric mice created 
with iPS cells that had been generated us- 
ing a retroviral transmission of the four 
canonical transcription factors, revealed a 
high incidence of tumors [42], but this was 
reported as being due to a reactivation of 
expression of one of the four factors (Myc). 
Insertional mutagenesis through the use 
of viral vectors is also considered a risk [3]. 

Synthetic biology may provide the 
means to prevent tumorigenesis from ad- 
ministered cells. One possible strategy 
would be to engineer stem cells with 
a synthetic ‘“‘self-destruction” system, so 
that they can be destroyed if they were 
to present a risk [55]. Such systems could 
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be implemented at varying degrees of so- 
phistication. At their simplest, they would 
consist of nothing more than inducible 
proapoptotic genes (coding, for example, 
activated caspases) that could be induced 
by a factor such as tetracycline, not nor- 
mally present in the body. At least in 
vitro, such inducible systems are very ef- 
fective at killing cells. An example of such 
a synthetic system has been constructed 
by Deans and colleagues [33], who inte- 
grated a controlled apoptosis module into 
mammalian cells, using the Bax proapop- 
totic protein driven by a robust two-level 
repression switch (see Fig. 2). The cells 
showed very significant levels of controlled 
cell death when the switch was ON. The 
switch could be induced by an external 
signal, such as a drug given intravenously. 
Unfortunately, the problem with such a 
simple system is that the cells which have 
performed the regenerative tasks required 
of them will also be destroyed (potentially 
killing the patient if a significant part of 
some vital organ is now composed of tissue 
derived from these cells). 

An alternative, more sophisticated ap- 
proach would couple the control of the 
synthetic proapoptotic system, via a ge- 
netic “logic gate’ (of which many have 
been constructed by synthetic biologists 
[36]), with a reporter of differentiated state. 
For example, in a patient who requires 
stem cell-based regeneration of heart mus- 
cle, the tetracycline-induced expression of 
caspases could be blocked by a protein 
produced from a gene downstream of a 
cardiac myosin promoter. Cells differenti- 
ated correctly into cardiac myocytes would 
ignore the drug, while cells differentiated 
into anything else, or cells that have not 
differentiated at all, would die. Such a 
selective killing system could even be in- 
duced prophylactically at some time after 


the introduction of stem cells, as a routine 
part of the clinical procedure. 

A working prototype of such a sophisti- 
cated system has already been constructed 
by Xie et al. [56]. These authors engineered 
a “cell-type classifier” that senses the ex- 
pression levels of a set of endogenous 
miRNAs and triggers a Bax-based apop- 
tosis of cells that match a predetermined 
profile. The authors tested the circuit on 
different lines of HeLa cells, analyzing the 
presence of six cancer-associated miRNA 
markers computed with complex Boolean 
circuitry. An improved version of this clas- 
sifier would be ideal for killing rogue stem 
cells based on a set of miRNAs specific 
to complex states such as differentiation 
or health status. Constructed the other 
way round, to kill all cells except those ex- 
pressing a given profile of miRNAs, the 
system could be used to allow only stem 
cells that had reached the desired state of 
differentiation to survive. 

Another approach to tumor avoidance, 
which could be used instead of or as well as 
the differentiation-dependent survival sys- 
tems described above, is a “‘time-bomb” 
mechanism based on counting cell cycles. 
This would induce apoptosis after a pre- 
determined number of cell divisions. A 
synthetic genetic counter that can count 
up to three induction events has been en- 
gineered by Friedland et al. in Escherichia 
coli [57]. To the present authors’ knowledge 
a mammalian version of this synthetic cir- 
cuit has not yet been created, but one could 
probably be constructed. 


2.4 
Targeting in the Host 


A major challenge for successful stem 
cell therapy is the implantation of a large 
number of healthy stem cells at the site 
that requires regeneration [58]. Current 
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research on endogenous cell homing in- 
vestigates the process by which stem cells 
naturally migrate and engraft to sites of 
injury via the circulatory system [59]. The 
mechanisms behind cell homing could in- 
spire new synthetic strategies to enhance 
the efficiency of stem cell homing in vivo. 
As illustrated above (see Fig. 3), syn- 
thetic biologists have engineered elegant 
receiver modules in mammalian cells for 
intercellular communication. These inte- 
grated circuits process external signals 
secreted by sender cells, controlling the 
response of the receiver cells. In a sim- 
ilar way, stem cells could be engineered 
to respond to signals sent by tissue need- 
ing repair. Signaling biomolecules such 
as growth factors and chemokines serve 
as chemoattractants for stem cell homing, 
and could perhaps act as signal molecules 
toward stem cells engineered to express 
the corresponding receptors. It may even 
be possible to emulate natural develop- 
ment and to engineer waypoint navigation 
[60]. For this, cells would first express re- 
ceptor systems for homing to the entry 
point of an organ and would then, when 
they have detected that they have reached 
that point, express a new receptor for accu- 
rate homing to the precise site where they 
are needed. This would be most valuable 
where no natural homing system has been 
discovered (the tissues most in need of re- 
generative medicine being precisely those 
without efficient natural regeneration). 
Even without such sophistication, the 
natural homing ability of some stem cells 
could be improved. The low efficiency of 
some therapeutic approaches is due to the 
expression of heterogeneous cell-surface 
adhesion receptors which prevent the cor- 
rect homing of stem cells in vivo [58]. 
Thus, engineering stem cells to express 
the appropriate membrane adhesion lig- 
ands (and to knock-down inappropriate 


ligands) should greatly improve the hom- 
ing process. 

Synthetic biology also raises the possibil- 
ity of externally guided stem cell homing. 
This would be useful where the introduced 
cells have the potential to colonize several 
sites but only one site needs to be targeted. 
There have been several demonstrations 
of light-activated gene expression [35], and 
tests in chicken embryos have suggested 
that red light can penetrate tissues fairly 
efficiently [61]. Cells carrying appropriate 
synthetic gene networks could express ad- 
hesion/homing ligands only in tissues that 
are illuminated by the clinician. Chemi- 
cally guided gene expression can also be 
used, at least for easily accessible tissues. 
Gitzinger et al. have reported a tunable 
genetic switch that is responsive to the 
apple metabolite phloretin [23]. These au- 
thors implanted engineered mammalian 
cells subcutaneously in mice, and showed 
that the cells would express a reporter in 
a dose-dependent manner when a lotion 
containing phloretin was applied locally to 
the skin. 

Another synthetic regulator device that 
might be useful for the spatiotemporal 
control of surface receptor expression is 
a miniaturized mammalian electrogenetic 
transcription circuit [62]. Weber et al. de- 
signed a device that was able to translate 
the frequency of alternating current into 
gene expression levels, by coupling an elec- 
trically triggered electrochemical reaction 
to a synthetic genetic circuit. As the minia- 
turized device is only 3 cm long (but might 
be scaled down further in the future), it 
should be possible to implant it at the 
site that needs repair in order to induce 
the expression of appropriate surface ad- 
hesion receptors, and facilitate the homing 
of engineered stem cells in response to an 
electrical signal. 
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Aside from homing, poor cell survival is 
another factor that can limit the therapeu- 
tic potential of stem cell therapies. One 
report has suggested that this can be im- 
proved significantly by engineering cells to 
overexpress enzymes that detoxify reactive 
oxygen species (ROS) [63]. 


2.5 
Population Control in the Host 


An assumption usually made by practi- 
tioners of regenerative medicine is that the 
stem (or stem-derived) cells that they intro- 
duce into a patient will grow to produce a 
tissue of appropriate size, and then stop. In 
cases of direct replacement of missing tis- 
sue with healthy tissue, the body’s growth 
control systems ought to ensure that this 
is indeed the case. However, in other types 
of regenerative medicine the introduced 
cells will not be identical to their normal 
counterparts, and will require an artificial 
control of their population. One example 
might be cells engineered to make insulin 
in a Type I diabetic patient where, for 
immunological reasons, it may be better 
to introduce another cell type engineered 
to make insulin according to the body’s 
requirements, rather than attempt to rein- 
troduce normal pancreatic beta-cells. 
Populations of specialized cells could 
be maintained at an appropriate size by 
homeostatic networks. Homeostasis is the 
result of a system that maintains a critical 
value within limits even when subjected 
to strong disturbances. A couple of inter- 
esting reports have described the use of 
synthetic circuits to maintain homeostasis 
of blood compounds in mice [64, 65]. 
These examples do not involve population 
control, but it should be possible to use 
similar gene networks to control a cell cy- 
cle regulator such as p27-Kip. In this way, 
the number of insulin-producing cells, for 


example, could be regulated to be appro- 
priate to meet the demand for insulin. 

Kemmer et al. designed a homeostasis 
circuit to regulate urate levels in blood, as 
uncontrolled levels can cause a number 
of pathologies including the formation 
of uric acid crystals (‘‘stones”) in the 
kidney [64]. The authors constructed a 
urate-responsive promoter that drives 
the expression of urate oxidase and is 
transcriptionally repressed in the absence 
of urate. When a certain threshold level 
of urate is reached, the cells secrete urate 
oxidase, which metabolizes urate. The 
cells were tested in mice deficient in urate 
oxidase and shown to reduce the levels 
of pathological urate crystal deposits in 
kidneys. 

Ye et al. designed a blood-glucose home- 
ostasis circuit that is controlled opto- 
genetically with blue light illumination 
[65] (Fig. 4). For this, the authors com- 
bined into mammalian cells the reti- 
nal receptor melanopsin, which triggers 
an influx of calcium when activated 
by light of the appropriate wavelength, 
and a transgene under the control of a 
calcium-dependent promoter. Upon illu- 
mination with blue light, the constitutively 
expressed melanopsin triggers expression 
of the transgene (glucagon-like peptide 1, 
which induces insulin secretion in vivo). 
The engineered cells were tested in Type 
II diabetic mice in a subcutaneous im- 
plant, and were able to modulate glucose 
levels in vivo in response to external light 
stimulation. 

Population homeostasis can be main- 
tained through quorum-sensing (a term 
borrowed from microbiology). This mech- 
anism is used by many microorganisms 
to sense their population density and re- 
spond, at high density, by secreting differ- 
ent compounds in a coordinated manner. 
Wang et al. integrated a quorum-sensing 


Synthetic Biology Approaches for Regenerative Medicine 


Blue light 2 


X 


Cc a2 
2 


Fig. 4. A synthetic mammalian 
circuit controlling blood-glucose 
homeostasis [65]. Upon illumina- 
tion with blue light, the constitutively 
expressed melanopsin triggers an in- 
flux of calcium. The expression of 
glucagon-like peptide-1 (GPL-1) is 
induced via the calcium-dependent 
NFAT promoter, which in turn acti- 
vates insulin secretion in vivo. 


network into mammalian cells [66] (Fig. 5) 
by designing sender and receiver modules 
that combined nitric oxide (NO) synthe- 
sis and the NO signaling pathway. In this 
case, the receiver module translates NO 
levels in the medium into expression of 
a reporter, while the sender module ex- 
presses NO synthase to produce NO. The 
authors integrated both modules in the 
same cell line in a positive feedback loop 
such that, at high cell density, accumulated 
levels of NO would trigger the expression 
of the synthase, which in turn would am- 
plify the build-up of NO levels, as indicated 
by the high expression of reporter. The 
cells behaved as predicted; at low popula- 
tion densities little reporter expression was 
seen, whereas as the NO levels increased 


with increasing population density the re- 
porter expression was increased. Again, 
this could — at least in principle — be cou- 
pled to an inhibitor of cell proliferation to 
keep Urn levels under control. 
Miller et al. designed and optimized 
an artificial tissue homeostasis system to 
maintain a self-renewing population of 
stem cells [67]. For this, quorum-sensing 
and synthetic cellular heterogeneity 
modules were used to regulate stem 
cell proliferation and differentiation in 
insulin-producing beta-cells. Although 
these investigations were relevant for 
the culture of stem cells in vitro, the 
authors noted that such engineered cells 
might also be transplanted to maintain 
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Fig. 5 A synthetic 

quorum-sensing circuit 

in mammalian cells [66]. At high cell 
density, accumulated levels of nitric 
oxide trigger the expression of NO 
synthase (NOS). The NO pos- 

itive feedback loop amplifies 

the build-up of NO levels and 

drives a high expression of the 
reporter. 
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populations of insulin-producing cells in 
vivo to treat Type I diabetes. 


2.6 
Reporting of Cell Fate 


Ideally, for clinical safety and patient 
care —and especially during the develop- 
ment phase of a new therapy — stem cells 
should report back on their health and fate, 
preferably through some medium of com- 
munication that does not require biopsy. 

Recently, Burrill et al. reported the 
design of a synthetic memory circuit that 
could be modified to report on stem cell 
exposure to stress conditions [68]. For 
this, genetic circuits were engineered in 
human cells that retained a memory of 
previous transient exposures to specific 
conditions such as hypoxia or radiation 
exposure. The device functions by linking 
a bistable transcriptional feedback loop 
(the memory circuit) to endogenous stress 
response pathways. Once triggered, the 
memory circuit activates the expression 
of a fluorescent reporter that could, in 
principle, be easily modified into a secreted 
version that would be detectable in the 
bloodstream. This would allow clinicians 
to monitor levels of output signal in the 
patient’s blood to gather clues about the 
survival of introduced stem cells. 

After implantation, some engrafted 
stem cells might contribute to the re- 
generation of diseased tissue, while some 
would remain undifferentiated and might 
become tumorigenic cells (as discussed 
above). Thus, stem cells must be engi- 
neered to report back as to their dif- 
ferentiation state in vivo. Subsequently, 
in what was a proof of concept study, 
Moore et al. engineered stem cells to ex- 
press a fluorescent reporter when they had 
successfully differentiated into cardiomy- 
ocytes [69]. Conversely, synthetic modules 


could be engineered to make stem cells 
express a reporter or signal molecule un- 
til they differentiated, and/or express a 
second signal output reporting on their 
tumorigenic state. 


3 
Ethical, Legal, and Social Implications 


Ethical and legal complications are noth- 
ing new to the field of stem cells. As well 
as facing the usual regulatory issues sur- 
rounding the development and testing of 
any new treatment for use in humans, the 
field has attracted vigorous ethical debate 
on the subject of ESCs because of their em- 
bryonic source. Although iPS cells are less 
ethically contentious, some techniques for 
their production involve genetic manipu- 
lation — which does not have unqualified 
public support, particularly within the Eu- 
ropean Union. Even though many of the 
applications of synthetic biology proposed 
here are for the purposes of improved 
safety, their use is likely to disturb at least 
some people, particularly with the “playing 
God” connotations that synthetic biology 
has acquired, due largely to the overpubli- 
cized reports of some of the field’s more 
colorful practitioners. 

It may be that the first applications of 
synthetic biology to the clinical use of stem 
cells will be made in vitro only, as an 
“intelligent niche,” so that the synthetic 
systems do not enter a human host. It 
is to be hoped, though, that success here 
will bring acceptance of synthetic systems 
integration also within the patient. One 
lesson to be drawn from the debacle about 
genetically modified crops in the European 
Union is that, if a new and potentially 
frightening technology is to be accepted, 
then the public must be engaged at an early 
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stage and the research conducted needs to 
be transparent. 
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Keywords 


Cellular reprogramming: 

The conversion of one cell type of a specific lineage to another cell type of a different 
lineage. The term is often used in the context of converting differentiated somatic cells to 
the undifferentiated pluripotent embryonic stem cell-like state (i.e., induced pluripotent 
stem cells). Nevertheless, it can also refer to conversion of one differentiated cell type to 
another differentiated cell type. 


Gene circuit: 

A naturally occurring or artificially constructed functional cluster of genes that 
influence each other’s expression through inducible factors and regulatory elements 
encoded by some of the genes themselves (i.e., transcription factors interacting with 
promoter/enhancer sequences). In synthetic biology, desired biological functionality 
is often achieved through one modular component receiving an input signal (i.e., 
hormone, neurotransmitter, metabolite), which is then relayed to another modular 
component that produces an output response (i.e., production of therapeutic peptide). 


Regenerative medicine: 

To restore, repair, or replace diseased/damaged tissues and organs via a broad range of 
innovative medical therapies encompassing cell transplantation and transfusion, tissue 
engineering, biomaterial grafts, and in-situ delivery/controlled release of various growth 
factors and small-molecule drugs. 


Stem cell: 

A cell with few or no specialized functions, which is capable of giving rise to more 
unspecialized cells of the same type (self-renewal), as well as other somatic lineages 
with specialized functions through the process of differentiation. 


Synthetic biology: 

The rational and systematic design/construction of desired functionality within 
biological systems, for therapeutic, industrial, and scientific applications. This often 
involves the application of computational and engineering principles within the field of 
molecular biology to achieve an eclectic fusion of multiple disciplines. 


From its humble beginnings in bacterial prokaryotic systems and bioprocess 
engineering, the rapidly expanding field of synthetic biology is now making 
deep inroads into the field of human clinical therapy. At the same time, the 
application of stem cells in regenerative medicine has demonstrated much promise 
in human clinical trials. The convergence of these two disparate disciplines opens up 
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many exciting possibilities, and offers novel solutions to various clinical challenges 
currently faced in the field of stem cell and regenerative medicine, including: 
(i) nonspecific pleiotropic effects of growth factors, cytokines, and extracellular 
matrix molecules on cellular differentiation and lineage fate determination; (ii) the 
need for intercellular communication and spatiotemporal coordination within 3D 
tissue-engineered constructs; (iii) safety issues pertaining to genetic modification 
of human stem cells and the utilization of recombinant DNA and viral vectors; 
and (iv) limited plasticity and proliferative capacity of adult stem cells and the need 
for extensive in vitro expansion to attain adequate cell numbers for therapeutic 
applications. In this review, these challenges are critically examined, together with 
the relevant safety and ethical issues pertaining to the application of synthetic 


biology in the field of regenerative medicine. 


1 
Introduction 


Synthetic biology is the rational and sys- 
tematic design/construction of desired 
functionality within biological systems, 
made possible through rapid advances in 
molecular biology and bioinformatics over 
the past few decades [1, 2]. The application 
of computational and engineering princi- 
ples within the field of molecular biology 
achieves an eclectic fusion of multiple dis- 
ciplines in synthetic biology [1, 2]. From 
its humble origins in prokaryotic bacte- 
rial systems and bioprocess engineering, 
synthetic biology has come a long way 
and is now making deep inroads into 
human clinical therapy and disease inter- 
vention [3, 4]. During recent years, the pace 
of progress within the field of synthetic 
biology has further accelerated through 
the increasing utilization of bioinformat- 
ics and computer-aided design software 
designed specifically for synthetic biology 
applications [5]. 

One particularly promising area in 
which synthetic biology may exert a sig- 
nificant effect on the development of new 
therapeutic modalities and clinical appli- 
cations is the rapidly progressing field 


of regenerative medicine. The synthetic 
biology approach potentially offers novel 
solutions to several pertinent challenges 
currently faced in the field of regener- 
ative medicine. These challenges include: 
(i) nonspecific pleiotropic effects of growth 
factors, cytokines, and extracellular ma- 
trix molecules on cellular differentiation 
and lineage fate determination [6]; (ii) 
the need for intercellular communication 
and spatiotemporal coordination within 
three-dimensional (3D) tissue-engineered 
constructs [7]; (iii) safety issues pertain- 
ing to the genetic modification of human 
stem cells and the utilization of recom- 
binant DNA and viral vectors [8]; and 
(iv) limited plasticity and proliferative ca- 
pacity of adult stem cells and the need 
for extensive in-vitro expansion to attain 
adequate cell numbers for therapeutic ap- 
plications [9, 10]. In this review, a critical 
examination is provided of how new tech- 
nology platforms and molecular toolkits 
developed for synthetic biology applica- 
tions are utilized to address these perti- 
nent challenges in the field of regenerative 
medicine. 
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2 

Synthetic Gene Circuits for Directing Stem 
Cell Differentiation and Establishing 
Intercellular Communication Networks 
within 3D Tissue-Engineered Constructs 


Directing stem cell differentiation into 
well-defined lineages is an essential pre- 
requisite for therapeutic applications in 
regenerative medicine, particularly in the 
case of undifferentiated induced pluripo- 
tent stem cells (iPSCs) and embryonic 
stem (ES) cells, both of which poten- 
tially generate teratomas upon transplan- 
tation or transfusion in situ [11]. Be- 
cause undifferentiated stem cells lack 
tissue/organ-specific markers, their en- 
graftment and integration within the tar- 
geted tissue/organ is obviously hindered. 
Moreover, the risk also exists that trans- 
plantation/transfusion of undifferentiated 
stem cells leads to their differentiation 
into multiple divergent lineages that may 
potentially exert detrimental effects on 
tissue/organ repair and regeneration. It 
is important to note that the unfavor- 
able pathologic environment of diseased 
tissues/organs could very well be noncon- 
ducive for directing the differentiation of 
stem cells into the desired lineage. Wang 
et al. [12] demonstrated that the trans- 
plantation of undifferentiated mesenchy- 
mal stem cells (MSCs) into post-infarcted 
hearts led to the differentiation of some 
of these cells into fibroblastic scar tissues 
instead of cardiomyocytes, which in turn 
hindered rather than assisted the recovery 
of heart function. 

Traditionally and canonically, stem 
cell differentiation is carried out by 
‘trial and error’ combinations of 
varying dosages of growth factors, 
cytokines, small-molecule chemicals, 
and extracellular matrix (ECM) substrata 
within the culture milieu, as attested by 


the majority of the publications within the 
scientific literature [6]. However, these 
varying dosages often exert nonspecific 
multiple pleiotropic effects on stem cell 
differentiation pathways and lineage fate 
determination, thus constituting one 
of the most intractable and pressing 
challenges in the field of regenerative 
medicine. Even with the most “optimal” 
combination of growth factors, cytokines, 
small-molecule chemicals, and ECM 
substrata, only a limited proportion of the 
stem cell population differentiates into 
the targeted lineage yet simultaneously 
yields a number of unwanted lineages 
[6]. Hence, the need exists for extensive 
selection, purification, and proliferation 
of the particular lineage of interest to 
attain sufficient purity and cell numbers 
for transplantation/transfusion therapy. 
The risk exists that cell dissociation for 
selection/purification protocols such as 
magnetic affinity cell sorting and flow 
cytometry might cause significant apop- 
tosis and de-differentiation attributable 
to the disruption of intercellular contacts 
and also dissolution of the ECM, which 
would be counterproductive to therapeutic 
applications in regenerative medicine. 
The advent of synthetic gene circuits 
constructed for synthetic biology applica- 
tions could offer a solution to the nonspe- 
cific pleiotropic effects of growth factors, 
cytokines, small-chemical molecules, and 
ECM substrata utilized for ‘‘milieu-based” 
stem cell differentiation protocols. By en- 
abling the precise temporal control of 
gene expression—in particular of up- 
stream transcription factors that regulate 
global gene expression — synthetic gene 
circuits are able to exert a more spe- 
cific effect on differentiating/committing 
stem cells to a particular desired lin- 
eage. Indeed, some recent studies even 
demonstrated a direct lineage conversion 
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from one differentiated phenotype to 
another through the recombinant ex- 
pression of specific genes — transcription 
factors — without the need to transition 
through the multipotent or pluripotent 
stem cell state. This phenomenon is com- 
monly referred to as “direct differentiation” 
[13]. Most notably, murine fibroblasts were 
differentiated directly into cardiomyocytes 
[14], neurons [15], and hepatocytes [16, 
17]; human fibroblasts were differenti- 
ated directly into neurons [18, 19] and 
hematopoietic progenitors [20]. 

Currently, a diverse array of synthetic 
gene circuits with different functions has 
been constructed through the application 
of engineering and computational prin- 
ciples in molecular biology. This array 
includes genetic switches [21-27], oscil- 
lators [24, 28-33], filters [34-40], commu- 
nication modules [35, 41-44], and other 
miscellaneous digital logic gates [45-47], 
the functions of which are analogous to 
similarly named electronic devices. 

Synthetic genetic switches are artifi- 
cial regulatory elements that enable cells 
to undergo a conditional transition be- 
tween gene expression states, the simplest 
example being an on/off gene expression 
in response to an appropriate stimulus. 
In 2008, the Fussenegger group [21] was 
the first to report a bistable genetic toggle 
switch constructed in mammalian cells, 
which also previously reported genetic 
switches with hysteresis function [22] and 
for epigenetic regulation [23]. Because cells 
naturally exhibit oscillatory variation (e.g., 
the cell cycle and the circadian clock) in 
gene expression, artificial regulatory ele- 
ments to enable oscillatory variation in 
gene expression are of tremendous inter- 
est in synthetic biology. Again, Fusseneg- 
ger and colleagues’ [28, 29] were the first 
to report synthetic genetic oscillators con- 
structed in mammalian systems. Filters 


and band passes are another important 
category of synthetic gene circuits that are 
required for noise control and to modulate 
the output of specific signal modes. The 
Fussenegger group reported the first syn- 
thetic bandpass filter in mammalian sys- 
tems, which involved switching on target 
gene expression within a specific concen- 
tration range of biotin [40]. 
Communication modules are of special 
interest to tissue-engineering applications 
that often require complex spatiotemporal 
control of gene expression, which in turn 
influences cell-migratory and distributive 
behavior within 3D cell-scaffold con- 
structs. That the spatial distribution and 
arrangement of cells is crucial to tissue 
and organ function is well known. The 
pioneering studies of Basu et al. [41, 35], 
who constructed the first communication 
modules in prokaryotic bacterial systems, 
involved the manipulation of bacterial 
quorum-sensing systems to effect varying 
distribution patterns of bacterial cells 
in specific shapes, such as clovers and 
ellipses. In subsequent studies conducted 
by Sprinzak et al. [42, 43], the Notch-Delta 
developmental signaling pathway was 
manipulated to serve as a communication 
module between different cell populations 
to facilitate developmental patterning 
[42], and to generate mutually exclusive 
signaling states [43]. More recently, 
Fussenegger et al. [44] developed com- 
munication networks that can orchestrate 
behavior in distinct mammalian cell 
subpopulations cocultured together in 
response to cell-to-cell metabolic signals. 
This approach consisted of a sender 
subpopulation of HEK-293 cells that 
had been engineered with a synthetic 
gene circuit to constitutively express the 
§-subunit of the Escherichia coli tryptophan 
synthase enzyme that converts indole 
(within the culture milieu) to L-tryptophan 
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as the signaling molecule; while the 
receiver cell subpopulation of HEK-293 
cells was engineered with a t-tryptophan 
sensor circuit that activates expression of 
the target gene in response to the presence 
of L-tryptophan within the culture milieu. 

Synthetic gene circuits may be engi- 
neered to be responsive to a diverse ar- 
ray of specific inducer molecules. These 
molecules include drugs and medications 
[48, 49], food supplements such as amino 
acids and flavorings [50-52], antibiotics 
[53-55], skin-penetrating chemicals [56, 
57], endogenous metabolites and hor- 
mones [58, 59], gaseous chemicals [60, 
61], and even physical stimuli such as 
light and electrical pulses [62, 63]. For 
safety reasons, designing synthetic gene 
circuits to be responsive to nontoxic and 
orally ingestible food supplements may be 
preferable, prominent examples of which 
include vitamin H [51], the amino acid 
arginine [50], and the strawberry flavor, 
2-phenyl ethyl butyrate [52]. However, the 
primary drawback is that vitamins and 
amino acids are found naturally in vivo 
within the human body, which may cause 
an unwanted activation of engineered syn- 
thetic gene circuits when not required. 
For certain specific therapeutic applica- 
tions, engineered synthetic gene circuits 
may be required to respond to specific 
drugs or medications being administered 
to the patient. For example, Ye et al. 
[62] engineered a synthetic gene cir- 
cuit that expressed a recombinant fusion 
protein GLP-1-Leptin for the treatment 
of metabolic syndrome in response to 
the antihypertensive drug guanabenz. For 
large-scale in-vitro cultures of differentiat- 
ing stem cells, designing synthetic gene 
circuits that respond to physical stimuli 
such as electrical pulse and light may 
be advantageous. Weber et al. [63] de- 
signed an electrogenetic transcription unit 


that modulates the beating frequency of 
primary heart cells, whereas Ye et al. [64] 
constructed a light-responsive synthetic 
gene circuit to modulate transgenic ex- 
pression of GLP-1 in type II diabetic mice. 


3 

Precise Gene Targeting and 
Genome-Editing Technologies to Allay 
Safety Concerns of Synthetic Gene Circuits 
Encoded by Recombinant DNA 


The deployment of synthetic gene cir- 
cuits for regenerative medicine applica- 
tions invariably requires the transfection 
of recombinant DNA into stem cells or 
their differentiated progenies. For ther- 
apeutic applications, the utilization of a 
non-integrating plasmid DNA may be 
preferable, but these DNA tend to be- 
come lost during cell division, which limits 
their utility for long-term applications. 
Additionally, a low probability exists of re- 
combinant plasmid DNA integrating into 
the cellular genome in a random man- 
ner, leading to the insertional mutagenesis 
of host genes that may result in adverse 
effects such as cancer. The use of viral vec- 
tors that integrate efficiently into the host 
cell genome without being lost during cell 
division faces the same problem. 
Nevertheless, during recent years 
precise site-specific gene targeting 
and genome editing technologies have 
emerged that are able to allay the safety 
concerns associated with the integration 
of recombinant DNA (encoding synthetic 
gene circuits) within the host cell genome. 
Hence, the site-specific recombinase 
(SSR) technology is the first technology 
platform to be developed that enables the 
site-specific integration of recombinant 
DNA [65-67]. The attachment of specific 
recombination sites to a segment 
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of recombinant DNA allows _ their 
site-specific integration within the host 
cell genome through the use of SSRs such 
as Flp-FRT [65], Cre-loxP [66], and PhiC31 
[67]. The advent of SSR technology was 
quickly followed by the rational design 
and use of zinc-finger (ZF) nucleases for 
site-specific recombination [68-71]. The 
PiggyBAC transposon system enables 
the transient insertion and removal of 
transgenic elements within the cellular 
genome in a precise and site-specific 
manner, and may be utilized for the 
temporary expression of synthetic gene 
circuits for a limited duration [72-74]. 
Wilson et al. [74] mapped a total of 
575 PiggyBAC transposon integration 
sites within the human genome, and 
showed that PiggyBAC-based integration 
and the excision of recombinant DNA 
fragments within targeted chromosomal 
locations is extremely precise. In turn, 
these advantageous properties were 
exploited for the transient expression of 
transcription factors for reprogramming 
somatic cells to iPSCs [75, 76]. 

In addition to molecular tools for pre- 
cise site-specific integration and exci- 
sion, the deployment of synthetic gene 
circuits in regenerative medicine may 
also benefit from the development of 
restriction enzyme-free molecular cloning 
technology platforms, such as Gibson 
DNA assembly [77], circular polymerase 
extension cloning (CPEC) [78], and se- 
quence and ligase-independent cloning 
[79]. The construction of synthetic gene 
circuits using conventional restriction 
enzyme-based methodology has several 
inherent deficiencies that include the pres- 
ence of a restriction site “scar” between 
annealed DNA fragment, and also the 
possibility that the presence of multi- 
ple restriction sites within the destination 


vector or target gene sequence may hinder 
molecular cloning. 


4 
Non-integrating Vectors for the Expression 
of Transgenes: Episomal Plasmids and 
Sendai Virus 


To avoid clinical safety issues pertaining 
to genetic modification, transgenes may 
be expressed with non-integrating vectors 
such as episomal plasmids and the Sendai 
virus, which has a completely RNA-based 
reproductive cycle. The challenge with 
non-integrating episomal plasmids is that 
transient transfection often leads to rel- 
atively short durations of expression of 
transgenes or gene circuits. In turn, this 
short duration yields relatively low repro- 
gramming efficiencies of somatic cells 
to iPSCs when episomal plasmids are 
utilized to express the classical repro- 
gramming factors. For example, in the 
pioneering study of Yu et al. [80], it was re- 
ported that iPSC colonies were observed at 
a frequency of only 0.0003-—0.0066% after 
20 days post-transfection with oriP/EBNA 
(Epstein-Barr nuclear antigen)-based epi- 
somal plasmid. Further studies utilizing 
episomal plasmids for reprogramming at- 
tempted to improve the reprogramming 
efficiency through a number of different 
strategies, including the supplementation 
of small molecules such as thiazovivin [81] 
and cotransfection with additional episo- 
mal vectors encoding SV40 large antigen 
[82] or EBNA1 [83]. The reprogrammed 
cells lost the episomal plasmids after 
10 serial passages following transfection 
[82, 84]. Nevertheless, whether there is 
absolutely no chance of genomic recombi- 
nation and insertional mutagenesis with 
episomal plasmids remains controversial. 
Certainly, rigorous screening is needed 
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to confirm the complete absence of any 
trace of episomal plasmid DNA within 
the genome of newly reprogrammed iPSC 
lines. 

Because the Sendai virus has a com- 
pletely RNA-based reproductive cycle, it 
may be used to express transgenes with- 
out the risk of genetic integration. The 
Sendai virus vector is retained transiently 
within the cytosol of infected cells for only 
a few passages, and is typically lost af- 
ter extended culture durations exceeding 
10 passages. The RNA-based Sendai virus 
is inadequate for more complex synthetic 
gene circuit applications because direct 
protein interaction with recombinant DNA 
is often needed (that is, promoter and op- 
erator sequences). Nevertheless, for less 
complex applications such as the tran- 
sient expression of transgenes, the Sendai 
virus is an appropriate vector and has 
been utilized in reprogramming studies to 
generate iPSCs from various somatic cell 
types at relatively high efficiencies [85-87]. 
Unfortunately, the main drawback of the 
Sendai virus is that it is much more tech- 
nically challenging and difficult to work 
with compared with retroviral or lentiviral 
vectors. 


5 

Modulating Gene Expression through 
Direct Delivery of Proteins and RNA into 
the Cell: A Safer Alternative for Stem Cell 
Differentiation and Reprogramming 
Compared with Recombinant DNA 
Transfection 


During recent years, new technology 
platforms have been developed to mod- 
ulate gene expression through the di- 
rect delivery of proteins and RNA into 
the cell. Such delivery may provide a 
safer alternative to recombinant DNA 


transfection with its attendant risk of 
permanent genetic modification of the 
cellular genome. The direct delivery of 
recombinant proteins and peptides into 
the cell is achievable through fusion with 
specific N-terminal domains that confer 
cell membrane-penetrating ability; these 
domains are commonly referred to as 
protein transduction domains (PTDs) [88, 
89]. Besides PTDs, the therapeutic pro- 
tein transduction of mammalian cells 
with nucleic acid-free pseudotyped lentivi- 
ral nanoparticles has also been reported 
by Fussenegger and colleagues [90]. In 
a study conducted by Zhou et al. [91], 
the four classical “Yamanaka” reprogram- 
ming transcription factors (OCT4, KLF4, 
c-MYC, and SOX-2) were fused to a 
polyarginine (11R) PTD, and these fu- 
sion proteins were used successfully to 
reprogram mouse embryonic fibroblasts 
(MEFs) to iPSCs. Subsequently, human iP- 
SCs were successfully derived from fibrob- 
lasts through the use of PTD-fusion tran- 
scription factors [92]. Nevertheless, the re- 
programming efficiency with recombinant 
PTD-fusion transcription factors were ex- 
tremely low compared with conventional 
reprogramming techniques that utilized 
recombinant DNA and/or viral vectors. 
Moreover, the production and purification 
of sufficient quantities of the recombinant 
proteins required for reprogramming was 
technically challenging. In addition to 
iPSC derivation, PTD-fusion transcription 
factors were utilized for directing stem 
cell differentiation into specific lineages. 
For example, Lima et al. [93] utilized 
pancreatic transcription factors (Pdx1 and 
MafA) fused to PTDs to direct mouse 
ES cell differentiation into the endocrine 
pancreatic lineage. The induction of myo- 
genic and neurogenic differentiation with 
PTD-transcription factors was also re- 
ported [94, 95]. 
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The direct delivery of synthetic mRNA 
was also utilized for cellular reprogram- 
ming and directing stem cell differenti- 
ation. In a seminal study performed by 
Warren et al. [96], a number of chem- 
ical modifications were carried out on 
synthetic mRNA to overcome its rela- 
tively short half-life and instability at 
transfection into the cell. These modi- 
fications included the attachment of a 
5’-guanine cap and the substitution of uri- 
dine and cytidine with pseudouridine and 
5-methylcytidine, respectively [96]. Addi- 
tionally, B18R protein was supplemented 
within the culture milieu to suppress the 
innate antiretroviral response of the cell 
to the transfected synthetic mRNA. By uti- 
lizing such chemically modified synthetic 
mRNA-encoding five reprogramming fac- 
tors (OCT4, KLF4, C-MYC, Lin28, and 
SOX2), Warren et al. [96] successfully de- 
rived human iPSCs from fibroblasts at rel- 
atively high efficiencies that were compa- 
rable to — if not better than — conventional 
methodologies that utilize recombinant 
DNA and viral vectors. In a_subse- 
quent study, Warren et al. [97] further 
improved the mRNA transfection-based 
cellular reprogramming technique by a 
stepwise optimization of the reprogram- 
ming factor cocktail, which included the 
substitution of wild-type OCT4 with an 
OCT4-fusion protein incorporating the 
MyoD transactivation domain. This im- 
provement reduced the time duration for 
reprogramming and the number of trans- 
fections required, which in turn mitigated 
the cytotoxic effects of the lipofectamine 
reagent utilized in transfection and re- 
sulted in an enhanced cell survivability 
[97]. In addition to cellular reprogram- 
ming, synthetic mRNA transfection may 
be utilized for directing stem cell differ- 
entiation into specific lineages. However, 
the only example found to date has been 


in the original study of Warren et al. [96], 
which utilized synthetic mRNA encoding 
the transcription factor MyoD to induce 
the myogenic differentiation of iPSCs. 

A direct transfection of miRNAs was 
also used successfully to reprogram mouse 
and human somatic cells to iPSCs, though 
only one such study [98] has been con- 
ducted to date. In the seminal study per- 
formed by Miyoshi et al. [98], human iPSCs 
were derived through the transfection of 
seven mature miRNAs (mir200c, mir302a, 
mir302b, mir302c, mir302d, mir369-3p, 
and mir369-5p) into human adipose stro- 
mal cells and dermal fibroblasts. Alto- 
gether, four transfections occurred at 48-h 
intervals over a total duration of six days; 
this was far fewer than the total num- 
ber of transfections required to reprogram 
using synthetic mRNA [96], and hence 
lessened the cytotoxic effects of chemi- 
cal reagent utilized for transfection (i.e., 
lipofectamine). Nonetheless, the repro- 
gramming efficiency was relatively low at 
0.002%, and a high level of chromoso- 
mal mosaicism of karyotypically aberrant 
and normal cells was found. More re- 
cently, repression of the Mbd3 protein was 
reported to enhance the reprogramming 
efficiency of human somatic cells to iPSCs 
[99]. It is possible that microRNAs (miR- 
NAs) that specifically target Mbd3 and 
which are identified in the future may be 
utilized together with the aforementioned 
miRNA species to achieve a more efficient 
derivation of human iPSCs. Currently, no 
reported studies exist on the direct delivery 
of miRNAs to direct stem cell differentia- 
tion into specific lineages. However, this 
process is not implausible because the 
expression of miRNAs encoded by recom- 
binant DNA was reported to promote stem 
cell differentiation into various lineages 
that have important applications in tissue 
engineering and regenerative medicine, 
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such as the cardiomyogenic [100, 101], 
neural [102-104], and osteogenic lineages 
[105-107]. 

Although the direct delivery of proteins 
and RNA is more compliant with clinical 
safety standards, and has significant poten- 
tial in regenerative medicine applications, 
it should be noted that these technol- 
ogy platforms are most likely unsuitable 
for more complex synthetic biology ap- 
plications involving multiple interacting 
components, which may only be fulfilled 
by synthetic gene circuits encoded by re- 
combinant DNA. 


6 

Breakthrough: Chemical-Based 
Reprogramming to Pluripotent Stem Cells 
with Small Molecules Alone 


Since the initial derivation of iPSCs 
through transgenic expression of tran- 
scription factors, numerous studies have 
reported a diverse array of small-molecule 
chemical compounds that enhance the re- 
programming efficiency with these trans- 
genically expressed proteins [108-110]. 
However, chemical-based reprogramming 
to iPSCs with small molecules alone 
was achieved only after the seminal 
study of Hou et al. [111], performed 
in 2013, was completed. By utilizing 
the high-throughput screening of numer- 
ous small-molecule chemical compounds 
with OCT4 promoter-driven expression 
of green fluorescent protein (GFP) in 
MEFs, together with a stepwise opti- 
mization of dosage and exposure times, 
mouse iPSCs may be derived at a rela- 
tively high reprogramming efficiency of 
0.2% by using a combination of just 
seven small-molecule compounds (VPA, 
CHIR99021, 616452, Tranylcypromine, 
FSK, DZNEP, and TINPB) [111]. 


Nevertheless, to date, no human 
iPSC has been derived using a similar 
small-molecule cocktail alone in the 
absence of the classical protein-based 
reprogramming factors. This derivation is 
expected to materialize in the near future 
because the proof-of-principle study 
using mouse iPSC derivation was accom- 
plished. The complete chemical-based 
reprogramming of human somatic cells 
to iPSCs would certainly go a long 
way in meeting the stringent clinical 
safety standards required in regenerative 
medicine applications. 


7 

Overcoming the Limited Plasticity and 
Proliferative Potential of Adult Stem Cells 
Using a Synthetic Biology Toolkit 


Unlike iPSCs or ES cells, adult stem 
cells originating from postnatal somatic 
tissues are relatively scarce, possess 
limited proliferative potential, and have 
restricted multilineage differentiation 
potential. Although major obstacles may 
hinder their widespread application in 
regenerative medicine, the synthetic 
biology toolkit may potentially offer some 
novel solutions to these challenges. For 
example, synthetic gene circuits may be 
designed for the inducible expression of 
genes involved in regulating proliferative 
potential and cellular senescence, such as 
human telomerase [112]. To extend the 
plasticity and multilineage differentiation 
potential of adult stem cells, synthetic 
gene circuits may be designed for the 
inducible expression of protein-based 
transcription factors that promote differ- 
entiation into specific desired lineages 
[113]. Additionally, inducible gene circuits 
may also be designed for the expression 
of siRNA to suppress differentiation into 
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undesired lineages. For example, bone 
marrow-derived | mesenchymal-derived 
stem cells have a certain bias to differenti- 
ate into the osteogenic and chondrogenic 
lineages through early expression of 
certain transcription factors such as Sox9 
(chondrogenic lineage) [114, 115] or Runx 
and Osterix (osteogenic lineage) [116, 
117]. If inducible synthetic gene circuits 
are designed to express appropriate 
iRNAs against these transcription factors, 
then the natural bias and predisposition 
of bone marrow-derived MSCs _ to 
differentiate into these lineages may 
be eliminated, possibly making these 
cells more amenable to differentiating 
into other desired lineages that may be 
driven by other complementary inducible 
gene circuits expressing appropriate 
transcription factors. 

To overcome the overwhelming safety 
concerns associated with the deployment 
of synthetic gene circuits encoded by re- 
combinant DNA, one strategy may be to 
directly deliver RNA or proteins into adult 
stem cells, as previously discussed. The 
limited half-life and transient effect of 
such delivered molecules could mimic the 
temporary inducible expression of trans- 
genes by synthetic gene circuits encoded 
by recombinant DNA. 


8 
Conclusions: Safety Issues and Future 
Outlook 


The field of regenerative and transplanta- 
tion medicine currently faces significant 
safety challenges pertaining to the use of 
genetically modified cells and viral-based 
vectors in clinical therapy. In order to 
safeguard patient health and safety, var- 
ious regulatory authorities (such as the 
U.S. Food and Drug Administration) have 


implemented many stringent and restric- 
tive regulations overseeing human clinical 
trials with genetically modified cells and 
viral vectors. Therefore, the deployment 
of complex-functioning synthetic gene cir- 
cuits (encoded by recombinant DNA) face 
significant regulatory and legislative hur- 
dles unless pertinent safety concerns are 
adequately addressed in the future. In the 
foreseeable near future, ‘‘biosafety engi- 
neering’ or the development of safety 
“brakes” or “switches” within synthetic 
gene circuits that can restrain the prolifer- 
ation and growth of genetically modified 
cells is needed. 

In any case, even if complex-functioning 
synthetic gene circuits are not directly ap- 
plied in clinical therapy, they still have 
significant potential for basic research in 
developmental biology and stem cell sci- 
ence. In particular, because the precise 
temporal expression of specific genes is 
achievable using inducible gene circuits, 
they may be used to model gene function 
in stem cell reprogramming, differentia- 
tion, and developmental pathways. This 
would undoubtedly contribute to major 
progress in the field of stem cell and re- 
generative medicine. 
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Keywords 


Synthetic circuits 
Engineered networks of biological components, such as genes or enzymes, which have 
the capability of using positive or negative feedback to modulate network output. 


Cell therapy 
The addition of cellular constituents to the human body in order to treat disease. 


Microbiome 
The collection of all microorganisms, along with their interactions, that are associated 
with the human body. 


Toggle switch 
A synthetic circuit that can stably remain in one of two gene expression states, and 
requires only a temporary stimulus pulse to switch states. 


Nanofactory 
A biomimetic cell consisting of a liposomal encapsulation of synthetic biological 
components and circuits designed to produce or process biomolecules. 


Changing the body’s cellular composition through the introduction of new cellular 
constituents has been dramatically successful as a medical intervention over the 
past century. The underlying biological signaling pathways and cellular interactions 
that control these successful cell therapies are of significant interest. In particular, it 
would be ideal if these pathways and interactions could be tuned to enhance cellular 
therapy. Fortunately, synthetic biology has emerged as a field that specializes 
in re-engineering cellular behaviors with synthetic networks. Synthetic biologists 
have developed engineered gene and protein circuits in microbial and mammalian 
cells by re-engineering novel phenotypes from across the phylogenic kingdoms. 
Together, these initial synthetic components and circuits — and the novel capabilities 
they control — have allowed biomedical scientists and engineers to create enhanced 
cellular therapies. Ultimately, these efforts to combine synthetic biology with cellular 
therapy are poised to make significant breakthroughs in the coming decades. 


1 
Introduction 


Approaches that change the cellular 
composition of the body through the 
introduction of new cellular constituents 


have proved to be dramatically success- 
ful as medical interventions over the past 
century. Ranging from the transplant of 
bone marrow stem cells to treat leukemia 
beginning in the 1950s, to the transplant 
of gut flora to fight Clostridium difficile 


infection in recent years, the manipulation 
of the body’s cellular composition through 
cell therapy has uniquely complemented 
pharmaceutical and surgical approaches 
to disease treatment. As a result, the un- 
derlying biological signaling pathways and 
cellular interactions that control success- 
ful cell therapies are of significant interest. 
In particular, it would be ideal if these 
pathways and interactions could be syn- 
thetically tuned to enhance cellular therapy 
[1]. Fortunately, the years since 2000 have 
been marked by the development of syn- 
thetic biology as a field that specializes 
in re-engineering cellular behaviors with 
synthetic networks [2]. Borrowing inspira- 
tion from electrical engineering, synthetic 
biologists have developed engineered gene 
and protein circuits in microbial and 
mammalian cells. Furthermore, they have 
harnessed and re-engineered novel phe- 
notypes from across the phylogenic king- 
doms to provide cells with new capabilities. 
Together, these initial novel circuits and 
capabilities have allowed biomedical sci- 
entists and engineers to create enhanced 
cellular therapies. For example, synthetic 
circuits and capabilities have been used to 
develop bacteria that invade cancer cells, 
as well as bacteria that disrupt cholera 
infection [3, 4]. At the same time, new 
synthetic circuits have been developed in 
mammalian cells that allow the regula- 
tion of blood glucose and the proliferation 
of T cells [5, 6]. Ultimately, these efforts 
to combine synthetic biology with cellu- 
lar therapy are poised to make significant 
breakthroughs in the coming decades. 

As a first step towards exploring the 
current and future impact of synthetic 
biology in cellular therapy, it will be im- 
portant to review developments in each 
field that are particularly amenable to in- 
tegration. In this chapter, cellular therapy 
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will first be discussed, using a broad def- 
inition that encompasses multiple areas 
of research where synthetic biology could 
be most useful. Given the engineering 
undercurrent of synthetic biology, this def- 
inition will include both research thrusts 
where living cells are re-engineered, as 
well as efforts to construct cell-like en- 
capsulations of biological networks and 
functions. Synthetic networks can be con- 
structed to control functions in both living 
cells and liposomal pseudo-cells. Further- 
more, although many synthetic networks 
can be constructed in vivo, cell-free syn- 
thetic biology has an important role in 
the optimization of synthetic pathways for 
cell therapy. Here, the discussion of cellu- 
lar therapy will be followed by a review 
of synthetic biology developments that 
are currently most useful for integration 
into cellular therapies. The genome-scale 
engineering tools discussed in Chapter 
1, Synthetic biology: Implications and 
uses, are certain to be useful in ex- 
panding these initial, synthetically enabled 
therapies in the future. However, ini- 
tial efforts in synthetic biology in gen- 
eral—and specifically in cell therapy - 
have focused on the development of indi- 
vidual, fundamental control modules that 
create new phenotypes. Ideally, at some 
future point, genome-scale techniques will 
allow diverse combinations of these syn- 
thetic circuits to be deployed in enhanced 
therapies. 

Consequently, in this chapter attention 
will be focused on the synthetic compo- 
nents and circuits that have thrust syn- 
thetic biology to the forefront of new 
bioengineering disciplines, as well as how 
they can enhance cell therapy. Over a 
decade ago, the creation of two engineered 
gene networks —a toggle switch [7] and 
an oscillator [8] — began the rapid advance 
of synthetic biology as a field. During 
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Fig. 1 Creating new cell therapies with synthetic circuits. A 


synthetic circuit design is engineered in a therapeutic cell. 
The cell is then transplanted into the body, interfacing with 
endogenous networks, causing a transition from a disease 


state to a healthy state. 


the years that followed, increasingly com- 
plex synthetic circuits were developed that 
enabled fundamental engineering control 
processes in cellular networks. These syn- 
thetic biological devices were inspired 
by electrical circuits as well as natu- 
ral biomolecular networks, and include 
timers, counters, clocks, logic processors, 
pattern detectors, and intercellular com- 
munication modules [9-16]. As shown 
in Fig. 1, these DNA-encoded synthetic 
circuits are typically uploaded into cells, 
enhancing them with new abilities that 
have been externally programmed by the 
circuit’s designer. In the case of cell thera- 
pies, upon infusion or implantation, the 
programmed networks can then inter- 
face with endogenous physiological net- 
works to correct aberrant conditions in 
vivo. 

New approaches using synthetic net- 
works for cell therapy are critical, because 
in many cases the creation of new medical 


treatments has stalled. Many technical 
hurdles need to be overcome. For example, 
efforts to regenerate tissues are slowly 
progressing [17]. Furthermore, many treat- 
ments for cancer such as chemother- 
apy and radiation have limitations that 
include incomplete tumor targeting, in- 
adequate tissue penetration, and _ toxic- 
ity towards normal cells [18]. Ultimately, 
new therapies are needed that can be 
customized to patients, and that go be- 
yond mainstream medical approaches. 
Synthetic biology is beginning to use 
its approaches and platforms to fill this 
void and transform biomedicine into an 
engineering science. Thus, after review- 
ing established thrusts in cellular therapy 
that are amenable to these approaches, 
key synthetic biological discoveries and 
developments will be described, following 
by a detailed description of new technolo- 
gies that are revolutionizing cell therapy 
using synthetic approaches. 


2 
Cell Therapy Successes 


In order to consider cell therapy from 
the perspective of its capacity for 
re-engineering using synthetic biology, it 
is important to employ a broad definition 
of cellular therapy. In this discussion, 
multiple possible medical interventions 
will be considered that can be regarded 
as different forms of cell therapy. In 
particular, it will be critical to expand the 
definition of the cell itself to include the 
pseudo-cell encapsulations mentioned 
above. To that end, five major thrusts 
in cell therapy can be considered as 
examples of cell therapy modialities 
that can be optimized by synthetic 
biology. Each of these is shown in 
Fig. 2. 


2 
Hematopoietic Stem Cell Transplantation 


Beginning in the mid-1950s with the 
first bone marrow transplants between 
identical twins by E. Donnal Thomas, and 
continuing through today, hematopoietic 
stem cell transplantation (HSCT) remains 
one of the most familiar and clearest 
clinical successes in cell therapy [19]. Ad- 
ditionally, the nuances and complications 
of the therapy are potentially amenable to 
engineering using synthetic biology. Ini- 
tially pioneered to treat leukemia, HSCT 
is now used to treat a broad range of 
disorders of the blood, bone marrow and 
immune system [20-22]. The sequence 
of clinical steps, which is surely famil- 
iar to biomedical scientists, consists of 
cell harvest, myeloablation of the resident 
hematopoietic stem cell (HSC) population 
using chemotherapy and/or radiation, fol- 
lowed by transplant of an autologous or 
allogeneic HSC graft by infusion [23]. 
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Care must be taken to address multiple 
complicating factors, and patient-specific 
tuning of the therapy by the attending 
clinician is critical, for example, through 
the administration of immunosuppres- 
sants or vaccines following transplantation 
[20, 23]. 

Especially in the case of allogeneic 
grafts, as the graft enables a return of 
hematopoiesis, two particularly common 
phenomena must be managed. First, fol- 
lowing allogeneic transplantation, some 
degree of graft-versus-host disease, medi- 
ated by T cells, is likely to be present. 
Although an allogeneic transplant must 
be performed using a donor that is hu- 
man leukocyte antigen (HLA)-matched to 
the recipient, it remains likely that new 
T cells will still have some reaction to 
the host’s cells [24]. In the case of the 
graft-versus-tumor effect, this action by 
T cells can be beneficial as the T cells 
recognize cancerous cells and help to elim- 
inate them [25]. Tuning this process of cell 
recognition through the introduction of 
synthetic genes and receptors is a vigorous 
field of research, and synthetic biological 
approaches could be especially useful in 
guiding the behavior of T cells in both 
graft-versus-host and graft-versus-tumor 
effects. 


2.2 
Engineered Immunotherapy 


In fact, engineering T cells to fight can- 
cer has become a widely pursued re- 
search goal over recent decades [26, 27], 
and several such studies have been con- 
ducted [28, 29]. The T cells themselves 
can be harvested from patients and en- 
gineered with chimeric antigen recep- 
tors (CARs) to target antigens present 
on the tumor cells. Initial intravenous 
grafts can persist for months, often 
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Fig. 2. Cell therapy can be defined broadly. 
Therapeutic cells can be delivered to the body 
in multiple ways. Hematopoietic stem cell 
transplantations, such as classical bone mar- 
row transplants, can treat cancers and other 
disorders of the blood and immune system. 
Bacterial cells can be added to the gut mi- 
crobiome to restore normal function and 
fight infection. New tissues, such as skin, 

can be created by mixing extracellular matrix 


expanding to cell counts over 1000-fold 
those of the initial engraftment. Much 
like HSCT, the off-target effects of T cells 
must be managed, and synthetic biology 
has significant potential in controlling 


Drug delivery 
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©~ 
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proteins (e.g., collagen) with epithelial progen- 
itor cells, and implanting into the body. Har- 
vested T cells can be engineered with chimeric 
antigen receptors to treat acute lymphoblas- 
tic leukemia. Even delivered drugs can use 
biomimicry to become cell-like; for example, 
liposomes encapsulating therapeutic molecules 
can form biomimetic cell membranes to more 
easily penetrate the tissues. 


T-cell behavior. Two recent examples 
have shown just how powerful en- 
gineered immunotherapy could be in 
targeting acute lymphoblastic leukemia 
(ALL). 


In one study, five adults with 
relapsed B-cell acute lymphoblastic 
leukemia (B-ALL) were treated using 
autologous T cells engineered to 
express a CD19-specific CD28/CD3¢ 
second-generation dual-signaling CAR 
termed 19-28z. These patients typically 
have a poor prognosis, and require a 
second, stable remission of their cancer, 
followed by a successful allogeneic HSCT. 
This new approach showed rapid tumor 
eradication, although high levels of 
cytokine-mediated toxicity did require 
management with steroids [28]. 

In another study, two children with 
relapsed and refractory pre-B-cell ALL re- 
ceived infusions of T cells transduced with 
anti-CD19 antibody and a T-cell signaling 
molecule (CTLO19 CAR T cells). In both 
patients, the engineered T-cell counts ex- 
panded to more than 1000-fold the initial 
engraftment level. The engineered cells 
were identified in bone marrow and in 
the cerebrospinal fluid (CSF), and per- 
sisted for at least six months. Cytokine 
release also required management in these 
cases [29]. Furthermore, some tumor cells 
emerged that no longer expressed CD19. 
Together, these studies demonstrate the 
potential avenues for synthetic biology to 
enhance therapy. For example, synthetic 
circuits could be designed that selectively 
turn on the targeting of specific antigens 
such as CD19. Later in this chapter, in 
the discussion of new synthetic biology 
advances in cell therapy, recent studies 
in tuning T-cell proliferation and poten- 
tially pausing cytokine release will be 
reviewed. 


2.3 
Tissue Engineering 


In addition to HSCT, another important 
field where stem cells can be deployed 
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to augment the body’s function — tissue 
engineering — is also an excellent potential 
area for synthetic biology to be leveraged. 
Cellular engineering has been broadly ap- 
plied to tissue engineering problems, and 
tissue engineering studies are a major 
thrust in the field of biomedical engi- 
neering. In most tissue engineering ap- 
proaches, living cells are combined with 
an extracellular matrix (ECM), and then 
either studied in the laboratory or ap- 
plied to a site of injury within a living 
model organism [30]. The ultimate goal 
of these tissue engineering studies is 
to generate functional tissues that can 
completely replace damaged tissue, while 
replicating the native tissue’s endoge- 
nous chemical, electrical, and mechanical 
properties. 

As a result, many studies have focused 
on preconditioning cells or tissue con- 
structs prior to analysis — or implantation 
followed by analysis — to coax internal sig- 
naling networks to behave as they would 
in the native tissues. These precondition- 
ing regimens can be biochemical cocktails 
designed to drive pluripotent embryonic 
stem cells (ESCs) [31] or, alternatively, 
multipotent adult stem cells [32], down 
specific lineages. They can also focus on 
coaxing immature cells toward more com- 
plex morphologies. Furthermore, these 
cellular preconditioning approaches can 
also be based on electrical [33] or mechan- 
ical stimuli [34]. Ideally, upon exposure to 
these preconditioning routines, cells will 
express phenotypes mirroring those in the 
native tissue, and will remodel the ECM 
to reproduce the complicated morphology 
of the tissue being replaced. Along these 
lines, promising studies have been con- 
ducted to generate multiple types of tissues 
in the laboratory, and just a few of these in- 
clude myocardium [35], cartilage [36], skin 
[37-40], and bone [41, 42]. Of these, the 
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most successful and widely used clinical 
therapy is engineered skin [38, 43, 44] (as 
shown in Fig. 2). 

Moreover, several overlapping engineer- 
ing thrusts within a related field- bio- 
materials — have focused on how appro- 
priate formulations of modified-ECM or 
ECM-surrogates can be designed. For 
example, a modified ECM can be designed 
to slowly release growth factors that shape 
stem cell differentiation [45, 46]. Alterna- 
tively, an ECM can be designed to replicate 
the strength and material geometry of the 
native ECM [47]. These approaches have 
become so intricate, yet robust, that engi- 
neered ECMs with complex morphologies 
have been developed that are compressible 
enough to be injected through a needle, yet 
retain their shape upon potential ejection 
in vivo [48]. However, these approaches 
focus on driving cellular behavior by engi- 
neering their environment. Directly engi- 
neering the internal biochemical networks 
of cells will be critical to speed the devel- 
opment of robust new tissues by directly 
programming cell behavior through syn- 
thetic biology. As a result, as the delivered 
cellular constituents are the critical com- 
ponent of the therapy, the definition of 
cell therapy should be expanded to include 
tissue engineering, particularly given its 
amenability to optimization through syn- 
thetic biology. 


2.4 
Enhancing the Microbiome 


In addition to using human cells as ther- 
apeutic interventions, microbial cells can 
also be introduced to the body to treat dis- 
ease. While any discussion of cell therapy 
naturally brings to mind methods such as 
bone marrow transplantation and tissue 
engineering as discussed previously, di- 
rectly manipulating and augmenting the 


cells in a patient’s microbiome has been 
a useful approach in regulating human 
physiology and correcting aberrant con- 
ditions. The human microbiome is com- 
posed of all the microorganisms associated 
with the body, numbers over 1000 species, 
and outnumbers the host cells by a factor 
of between 10 and 100 [49]. The individ- 
ual species within the microbiome are 
typically well-tolerated, commensal mi- 
croorganisms and, as a result, they are 
potentially excellent vectors for deploying 
synthetic gene circuits to treat disease. 
Moreover, because most synthetic biology 
tools and approaches have been devel- 
oped in microorganisms, cell therapies 
based on re-engineering the microbiome 
with synthetic biology may be especially 
feasible. 

The human gut is perhaps the most 
accessible of the body’s microbial ecosys- 
tems for deploying cell therapy (Fig. 2). 
Although the health benefits of augment- 
ing the gut microbiome has been recog- 
nized for millennia (e.g., by consuming 
active food cultures), several key discov- 
eries and therapies have been developed 
over the past century. While most studies 
of the gut microbial activity have focused 
on pathogenic species, such as the ground- 
breaking discovery of Helicobacter pylori 
and its pathogenic role [50], many stud- 
ies have revealed the treatment of disease 
by augmenting the gut microbiome with 
beneficial microbes. For example, in 1917 
Alfred Nissle discovered a strain of Es- 
cherichia coli that could be used to treat 
gastrointestinal disorders, a strain which 
he isolated from stool samples from sol- 
diers in the trenches of World War I [51]. 
This strain, E. coli Nissle 1917, has become 
one of the most extensively studied en- 
teric organisms, and its efficacy in treating 
diseases ranging from leaky gut to gastric 
colitis is well documented [52, 53]. In fact, 


it continues to be sold today under the 
pharmaceutical name Mutaflor®. 

More recently, several groundbreaking 
studies have linked the gut microbiome 
to diabetes, obesity, and heart disease. 
For example, Everard et al. recently found 
that Akkermansia muciniphila, numerous 
within the gut, were present at lower than 
normal levels in mice with type 2 dia- 
betes and also in obese mice [54]. Tang 
et al. recently showed that intestinal mi- 
crobial metabolism was directly linked to 
levels of a proatherosclerotic metabolite, 
trimethylamine-N-oxide (TMAO) in the 
body [55]. Moreover, microbiome trans- 
plantation has also been shown to be 
effective in treating Clostridium difficile 
infection [56, 57]. In a recent study, 15 
out of 16 patients suffering from C. dif- 
ficile and receiving fecal transplants were 
cured, in comparison to three out of 13 
patients receiving vancomycin alone [58]. 
This confirmed a critical ability of the 
healthy microbiome in protecting against 
infection, and also showed the potential of 
basing therapies on gut microbiome trans- 
plants. However, in many cases — such as 
the C. difficile treatment described — it is 
unclear which microbiome constituents 
are most critical to treatment. Surely, so- 
cial interactions play an important role 
in microbiome communities [59, 60], and 
could potentially be harnessed in synthetic 
biology approaches to optimizing micro- 
biome cell therapies. These interactions 
are further complicated by genetic infor- 
mation exchange among constituents, for 
example, through bacteriophage infection, 
which was recently shown by Collins and 
colleagues to be enhanced following an- 
tibiotic treatment [61]. In the future, these 
virally mediated genetic exchanges could 
be prime entry points for deploying syn- 
thetic biology approaches in cell therapy 
using the microbiome. 
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2.5 
Liposomal Encapsulations 


The definition of cell therapy can even be 
broadened to include nanoscale carriers of 
biologics. These nanocarriers include poly- 
mer conjugates, polymeric nanoparticles, 
and lipid-based carriers such as liposomes 
and micelles. Here, the utilization of 
lipid-based carriers, specifically liposomes, 
will be discussed because they are widely 
used in the clinic [62]. By creating these ar- 
tificial cell membranes, biological agents 
can better pass through the body by camou- 
flaging themselves as native cells [63, 64]. 
Just as biomolecules can be integrated into 
cell membranes, synthetic molecules can 
also be embedded within artificial mem- 
branes, modifying the function and tar- 
geting of these liposomal compartments 
[65]. As these drug-delivery vehicles have 
grown in complexity, various biomedical 
scientists and engineers have suggested 
the paradigm of the biological nanofactory 
[66, 67], whereby the system encapsulated 
by a liposome functions as a molecular 
assembly line, constructing critical ther- 
apeutic molecules in situ. Clearly, these 
systems would be amenable to synthetic 
pathways that control the production of 
these molecules. Ultimately, encapsulat- 
ing synthetic biological networks within 
liposomes will be a critical new direction 
in synthetic biology. Thus, expanding the 
definition of cell therapy to include these 
artificial cellular systems will be impor- 
tant when considering the role of synthetic 
biology in cell therapy. 


3 
Synthetic Circuits 


Synthetic circuits represent one of the core 
technologies available in synthetic biology. 
They are critical for reprogramming 
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cellular phenotypes to enhance therapies 
and, as noted previously, their develop- 
ment has spurred the growth of synthetic 
biology asa field. In the discussion that fol- 
lows, the design approaches in synthetic 
circuit construction will be outlined and 
a review provided of robust gene and 
protein circuits that have been success- 
fully constructed. Synthetic biology and its 
revolutionizing approaches to cellular en- 
gineering have already shown the potential 
to impact cell therapies by altering their 
internal signaling networks. Whenever at- 
tempting to engineer cellular signaling 
cascades, care must be taken to ensure 
a robust design and a predictable cell be- 
havior, along with a repeatable and precise 
expression of critical genetic outputs. Bi- 
ological systems naturally incorporate all 
of these requirements in their internal 
programming. 

As noted above, the development of 
two synthetic circuits — the toggle switch 
and an oscillator deemed “the repres- 
silator” —inspired the field of synthetic 
biology as far back as 2000 [7, 8]. Instead 
of open-loop control of gene expression, 
these circuits employed feedback to 
provide an additional level of complexity 
in the precision control of phenotype. 
During the years following these develop- 
ments, however, new circuits have been 
developed that leverage components from 
a range of different organisms. Synthetic 
circuit design has also expanded beyond 
these initial circuits that were based on 
DNA-protein interactions, such that 
circuits based on protein-protein inter- 
actions are now also included [68, 69]. In 
both cases, new computational algorithms 
and software programs have been created 
to speed the engineering of synthetic 
biological circuits [70-73]. These research 
thrusts have built a foundation for 
synthetic biology as a complete discipline. 


3.1 
Engineering Synthetic Gene Circuits 


In order to develop the first synthetic 
circuits, several molecular biological 
components and control structures had to 
be available. During the last decades of the 
twentieth century, investigators began to 
develop control elements for each step in 
the cascade of events between gene tran- 
scription and protein translation. Each of 
these processes is regulated by biophysical 
interactions, such as the docking of 
transcription factors and RNA polymerase 
to operator sites on promoters, or the 
binding of ribosomes to messenger RNA 
(mRNA) transcripts. Synthetic biology 
components that control and manipulate 
all of these interactions have been created, 
and these components can be combined 
to form modules for the design and con- 
struction of synthetic circuits. Biological 
circuits consist of modules of interacting 
genes and proteins, and synthetic biology 
seeks to rewire these modules to create 
new circuits and new phenotypes. Several 
other components have been created that 
leverage the interactions between RNA 
molecules themselves [74, 75], including 
RNA molecules and  small-chemical 
molecules [5, 76-79], and recombinases 
with DNA templates [11, 80]. 

For example, new engineered promoters 
[81] were critical in the design of networks, 
such as the aforementioned toggle switch 
which was developed in the laboratory of 
James Collins [7]. As shown in Fig. 3a, the 
toggle consists of two mutually repress- 
ible genetic operons such that, when the 
strength of each repression event is bal- 
anced, a bistable switch is formed. Over 
multiple generations of cell division the 
toggle will remain in either an ON or 
OFF state, regardless of the presence of 
a corresponding inducer. In this network 
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Fig. 3. A broad range of synthetic compo- 
nents and control structures exist. (a) Syn- 
thetic components can be combined to create 
control structures such as memory. Two mutu- 
ally repressing operons can be used to create 
a bistable toggle switch; (b) Alternately, logic 
behavior such as a digital AND-gate can be 
created. Two engineered, synthetically activated 
operons alternately drive the expression of a 
faulty viral polymerase transcript or an RNA 
element that corrects the resulting translation 


Matching Msg5 


Synthetic Biology Approaches to Cell Therapy 


SVIN 
SVX 
4D) 
e Tunable RNA 
(4 Ry feedback es) 
Tunable protein 
feedback es) 


(c) 


Leucine 
zipper 


Gene expression 


zipper (MAPK 


phosphatase) 


Negative 
feedback 


a-factor 


Ste50 
(mediates 
Ste11 activation 
by Ste20) 


Matching 
zipper 


errors. When expressed together, the result- 
ing translated viral polymerase can activate 
the expression of an output gene; (c) Filter 
characteristics can be modified by tuning the 
degradation of protein or RNA regulators; (d) 
A synthetic scaffold can be rewired to control 
the mitogen activated protein kinase (MAPK) 
cascade in eukaryotes. By engineering docking 
sites for positive and negative effector proteins 
using leucine zippers, synthetic circuits can be 
created that control the MAPK cascade. 
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topology, the ability of each operon to re- 
press the other must be balanced, and 
these repressing interactions are governed 
by several factors, such as the strength 
of each operon’s promoter as well as the 
strength of the ribosome binding sites 
(RBSs) associated with each repressor’s 
mRNA transcript. By carefully altering the 
nucleic acid bases in each promoter and 
RBS, the switching properties of the toggle 
switch can be manipulated. 

Other synthetic networks illustrated the 
potential for digital logic gate behavior 
in synthetic circuits [14, 80, 82-84]. In 
one of the first examples, Anderson et al. 
created a synthetic AND-gate in bacteria 
(see Fig. 3b). This AND-gate switching be- 
havior, mimicking the gates enabled by 
integrated circuits in electrical engineer- 
ing, employed an engineered viral RNA 
polymerase with a slight defect (an am- 
ber - UAG - stop codon) in the middle 
of its transcript [82]. As a result, when 
one input (e.g., salicylate) was provided 
only a partial polymerase was expressed. 
However, with the addition of a second 
input (e.g., arabinose), the transcription 
of a small RNA was initiated and this 
small molecule functioned to suppress 
the defect. As a result, the fully trans- 
lated viral polymerase was able to acti- 
vate its viral promoter. In another study, 
the Voigt group engineered communities 
of bacteria to produce NOR-gate behav- 
ior [83]. This is especially important as 
NOR-gates are Boolean complete, and there- 
fore can be combined to form any other 
type of logic gate. As a result, these stud- 
ies suggested that with sufficient discrete 
bacterial colonies, significantly more com- 
plexity in digital behavior computing could 
be possible. Alternately, other applications 
could require analog signals, and to this 
end studies conducted in the laboratory of 
Timothy Lu have recently resulted in the 


creation of sophisticated analog circuits in 
single bacterial cells [85]. 

Other synthetic circuits are also pos- 
sible, such as the genetic timers created 
by Collins and colleagues. These timers, 
when deployed in yeast, allow flocculation 
to be precisely timed [10] and therefore 
can be used to control the fermentation 
process. In another interesting study, 
Stricker et al. showed that a minimal set 
of components could create an oscillator 
which was much more robust than 
the repressilator [86]. In fact, several 
oscillators have been developed since the 
original repressilator in both bacterial 
and mammalian systems [8, 12, 86-89]. 
These oscillator enhancements mirror 
efforts to improve the efficiency of other 
basic control modules that have been 
built into synthetic gene circuits. Other 
studies also have demonstrated that signal 
filtering can be created in autoregulated, 
negative-feedback circuits by tuning the 
degradation level of repressor elements 
(see Fig. 3c) [90-93]. More relevant to 
cell therapy, each synthetic circuit could 
allow unique, precise types of signaling 
behaviors to drive the therapeutic function 
of cellular therapeutics. 


3.2 
Engineering Synthetic Protein Circuits 


Although engineered circuits that rely on 
DNA-protein interactions have formed 
a major thrust in synthetic biology, a 
second key thrust that could be even 
more critical in leveraging synthetic bi- 
ology in cell therapy has been the devel- 
opment of protein-protein signaling cir- 
cuits. Just as genetic components allowed 
the development of diverse, tunable syn- 
thetic gene circuits, engineered protein 
signaling components have also allowed 
for new synthetic circuit designs. These 


engineered protein interactions have been 
important in two key ways. First, many of 
these new components interface with the 
previously described genetic circuits, pro- 
viding enhanced features. Second, other 
aspects of these components allow for 
robust, speedy protein-protein interac- 
tion circuits. In fact, one key benefit of 
these protein-based circuits is the oppor- 
tunity to take advantage of the speed at 
which proteins interact. In comparison to 
gene expression, enzyme-mediated single 
phosphorylation signals are evident in cel- 
lular phenotype on the order of seconds, 
whereas gene expression events frequently 
require minutes or longer to emerge in a 
cell’s phenotype. 

Several exciting examples of these 
protein-protein components have been 
developed during the past few years. One 
example has been the significant impact of 
synthetic biology on the field of optogenet- 
ics where, in preliminary studies, a light 
sensor was created by fusing a cyanobacte- 
rial photoreceptor to an E. coli intracellular 
histidine kinase domain in order to control 
gene expression [69]. Subsequently, these 
studies have been expanded to allow acti- 
vation by multiple different wavelengths, 
along with multiple applications in a sig- 
nificant expansion of the components 
available to synthetic biologists [6, 94—96]. 
In a particularly interesting advance, June 
Medford’s laboratory developed a plant 
signaling receptor that allowed plants to 
detect trinitrotoluene (TNT). By using this 
synthetic component, Medford’s group en- 
gineered sensitive Arabidopsis plants that 
turned white in the presence of TNT [97]. 
Beyond components, protein-protein in- 
teractions have also formed the basis of 
synthetic circuits. Some of the most in- 
teresting findings along these lines have 
derived from the laboratory of Wendell 
Lim and colleagues [68, 98-100], who 
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showed that positive and negative feedback 
loops could be engineered by anchoring 
synthetic protein signaling components 
directly to an engineered protein scaffold 
using leucine zippers [68] (see Fig. 3d). 


33 
Deploying Circuits in Mammalian Cells 


The initial circuit modules developed in 
microbes such as bacteria and yeast have 
been expanded in higher-order eukary- 
otes, and several synthetic circuits are now 
available that can robustly control gene ex- 
pression in mammalian cells. This precise 
control of specific genes will be critical 
for effective cell therapies. As an example 
of these mammalian circuits, Deans et al. 
developed a tunable, modular mammalian 
genetic switch [101]. This synthetic gene 
network coupled repressors with an RNA 
interference (RNAi) design involving short 
hairpin RNA (shRNA). Gene expression 
was turned on by the addition of an in- 
ducer, which controlled the transcription 
of a repressor and simultaneously turned 
off generation of the RNAi component. 
Thus, this synthetic module would allow 
the transcript to be retained and translated 
(Fig. 4a). This construct offered >99% re- 
pression, along with an ability to tune gene 
expression. This modular component can 
allow for the regulation of any gene, and 
was validated in both mouse and human 
cells. 

Another synthetic control module — the 
toggle switch —has also been developed 
in mammalian cells (Fig. 4b). In multi- 
cellular systems such as the tissues of 
humans, cell identity is regulated by epi- 
genetic networks that determine which 
genes become part of each cell’s tran- 
scriptome. By combining two repressors, 
which control each other’s expression, 
Fussenegger and colleagues developed a 
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Fig. 4 Synthetic circuits for the control of 
mammalian cells. (a) Tight control of specific 
genes will be critical for effective cell thera- 
pies. Here, a tunable, modular mammalian ge- 
netic switch was created that couples repres- 
sor proteins with an RNAi design involving 
shRNA. The switch is controlled by an inducer, 
which activates repressor expression, while si- 
multaneously turning off RNAi components, 
allowing the output gene to be translated; (b) 
Genetic toggle switches can also be created 

in mammalian cells. By combining engineered 
streptogramin and macrolide-inducible promot- 
ers to drive the expression of their respective 
engineered repressors, the mammalian out- 
put gene secreted alkaline phosphatase (SEAP) 
can be stably toggled by pulsing the system 
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with erythromycin (E) or pristinamycin | (PI); 
(c) Synthetic circuits can be constructed in 
yeast with predictable behavior upon transfer 
to mammalian cells. Here, a linearizer gene 
circuit was created using the TetR repressor 
and an arbitrary target gene, both controlled 
by the same promoter. In the absence of in- 
ducer, the TetR protein blocks transcription 
from both promoters. When an inducer is 
added to cause de-repression, protein levels 
increase until TetR synthesis exceeds inducer 
influx, at which point TetR blocks both pro- 
moters once again. By using analytical and 
computational modeling, investigators were 
able to accurately predict the behavior of the 
circuit in mammalian cells after recording its 
behavior in yeast. 


mammalian epigenetic circuitry that ex- 
hibited genetic toggle behavior and could 
be switched by using two different drugs. 
Fussenegger’s group used the toggle to 
regulate the expression profiles of a hu- 
man glycoprotein in engineered Chinese 
hamster ovary cells, and these cells also 
functioned after microencapsulation and 
implantation into mice. As with previous 
bacterial toggles, the switching dynamics 
and expression could be predicted with 
mathematical models [102]. 

While these synthetic networks demon- 
strate the usefulness of synthetic control 
in mammalian cells, the engineering of 
synthetic networks in mammalian cells is 
significantly more challenging than devel- 
oping circuits in bacteria or yeast. Bac- 
teria and yeast are much more tractable 
as model organisms for the design and 
tuning of synthetic circuits, as they are 
more easily genetically engineered. Ide- 
ally, mammalian circuits could be first 
developed in simple eukaryotes such as 
yeast in a manner that would allow their 
behavior in mammalian cells to be eas- 
ily predicted upon transfer. Thus, for 
cell therapy applications, once a clinician 
has determined the mammalian cellu- 
lar behavior needed to treat a disease, 
the corresponding synthetic circuits could 
be rapidly developed in yeast. Synthetic 
biologists could generate a diverse li- 
brary of functional circuits — for example, 
through approaches such as site-directed 
mutagenesis — each with a predictable 
function in human cells. The resultant 
library would then provide the clini- 
cian with a spectrum of synthetically 
engineered cell therapies that could be 
prescribed. 

As a step towards predictably transfer- 
ring synthetic circuits from yeast to mam- 
malian cells, the group of Gabor Balazsi 
constructed a mammalian version of a 
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negative feedback-based “‘linearizer’’ gene 
circuit (Fig. 4c). Although the working 
version from yeast was initially nonfunc- 
tional in mammalian cells, computational 
modeling suggested that function could 
be recovered by improving nuanced gene 
expression and protein localization. As a 
result, after rationally developing and com- 
bining new synthetic parts — as suggested 
by the model — the circuit regained func- 
tion in human cells. The investigators were 
then able to tune and target the gene ex- 
pression linearly and precisely, just as they 
had been able to do in yeast. This ap- 
proach should be relevant in transferring 
many gene circuits of interest from yeast 
to mammalian cells, and will be critical for 
the rapid development of therapies in the 
future [103]. 


4 
Cell Therapies Enabled by Synthetic Biology 


The above-described collection of engi- 
neered gene networks forms the founda- 
tion of a large synthetic biology toolkit 
that is now available for the creation of 
new cell therapies. These synthetic com- 
ponents and circuits— which have been 
developed in both microbial and mam- 
malian cells—form the basis of several 
approaches that have shown promise in 
regulating human physiology. A selection 
of these new approaches towards synthet- 
ically enabled cell therapy is described in 
the following sections. 


4.1 
Cell Therapy with Engineered Bacteria 


As noted above, the cells which regu- 
late human physiology include both the 
body’s human cells as well as the col- 
lection of all microorganisms associated 
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Fig. 5 The gut microbiome can be syntheti- 
cally reprogrammed by adding new members 
to the gut ecosystem. Probiotic, commensal 

E. coli were engineered to secrete the molec- 
ular signal cholera autoinducer (CAI-1), which 


with the body (i.e., the microbiome). As 
these commensal organisms are well tol- 
erated, they provide an ideal platform for 
launching synthetic networks in the body, 
and with this approach in mind Duan and 
March recently engineered a commensal 
strain of E. coli to prevent cholera in- 
fection by creating a synthetic interaction 
between gut microbes [4]. Cholera infec- 
tion is marked by the secretion by Vibrio 
cholerae of virulence factors which, at a low 
population density, include cholera toxin 
(CT). In order to assess its own density, V. 
cholerae employs quorum sensing, a pro- 
cess in which the cell secretes and detects 
autoinducer signaling molecules. Specif- 
ically, V. cholerae detects cholera autoin- 
ducer 1 (CAI-1) and autoinducer 2 (AI-2). 
When the levels of both autoinducers are 
high the expression of virulence factors 
is ceased. Duan and March leveraged this 
mechanism by engineering E. coli which 
were already producing AI-2 to also secrete 
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leads to inhibition of V. cholerae virulence. In 
mice infected with cholera, survival rates were 
dramatically increased for mice that had con- 
sumed the engineered probiotic several hours 
prior to ingesting V. cholerae. 


CAI-1 (Fig. 5). Thus, it was found that 
when infant mice ingested the engineered 
E. coli at 8h prior to V. cholerae ingestion, 
their survival rate was increased dramati- 
cally. Furthermore, the intestinal binding 
of CT was reduced by 80%. 

In addition to using engineered mi- 
crobes to deter infection by other micro- 
bial species, they could also be added to 
a patient’s microbiome to secrete ther- 
apeutic molecules directly to the body. 
Along these lines, commensal bacte- 
ria have been engineered to secrete 
molecules for the treatment of several dis- 
eases, including insulinotropic proteins 
for diabetes [104], a human immunode- 
ficiency virus (HIV) fusion inhibitor pep- 
tide to prevent HIV infection [105], and 
interleukin-2 for immunotherapy [106]. 
Although these studies demonstrated an 
effective expression of the therapeuti- 
cally relevant molecules, the addition 
of synthetic circuits could allow these 


therapies to be more precisely tuned to a 
patient’s physiology. For example, in order 
to provide an effective cell therapy, gene 
expression could be turned on only when 
the prescribed molecular interventions are 
needed. This increased control would re- 
duce the metabolic load on the bacteria 
and increase their ability to assimilate into 
the microbiome. 

Another approach towards using en- 
gineered bacteria to treat disease has 
been explored by synthetic biologists who 
have created bacteria which seek and 
destroy cancer cells. Although it is not 
entirely clear how these engineered mi- 
crobes should come into contact with 
tumors occurring distal to host locations 
where microbes normally reside, these en- 
gineered microbes employ an interesting 
approach to cancer treatment. In one such 
study, Voigt and colleagues created bac- 
teria that invaded cancer cells only in 
the hypoxic environments that frequently 
are indicative of tumor tissue, as well as 
bacteria that used quorum-sensing com- 
ponents from other species to amplify 
their invasion response [3]. As shown in 
Fig. 6a, E. coli were engineered to invade by 
causing them to express invasin (inv), an 
adhesion protein from Yersinia pseudotu- 
berculosis, which tightly binds mammalian 
6, integrin receptors and induces up- 
take. Invasin expression was driven by 
the LuxI/LuxR quorum-sensing genes, al- 
lowing E. coli to produce an autoinducer 
that amplified invasin expression as the 
colony grew. Of course, the use of bac- 
teria to treat cancer might be associated 
with a risk of infection from the bacteria 
themselves. 

However, in another study, Li et al. 
delivered (intravenously) an engineered, 
cancer-invading bacterium to target a tu- 
morigenic pathway in vivo [107]. For this, 
RNAi was used to create bacterial invaders 
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that knocked down the expression of 
CTNNB1 (encoding -1 catenin), a gene 
that initiates many colon cancers on its 
being overexpressed or undergoing onco- 
genic mutation (Fig. 6b). The engineered 
bacteria generated shRNA segments that 
bound to the CTNNB1 mRNA transcripts 
and induced mRNA cleavage. In addi- 
tion to shRNA and invasin, the synthetic 
system also produced lysteriolysin O (en- 
coded by the hylA gene), which enabled 
molecular transport out of the vesicles, 
potentially via entry vesicle disruption. 
Subsequently, when the engineered bac- 
teria were delivered intravenously into 
immunodeficient mice that had been 
xenografted subcutaneously with human 
colon cancer cells, a significant knock- 
down of the gene was observed in the 
tumor cells. 


4.2 
Optimizing Immunotherapy with Synthetic 
Biology 


While synthetic biology can be used to 
optimize the body’s microbial inhabi- 
tants, its most powerful applications in 
cell therapy could emerge as an abil- 
ity to directly alter the human host’s 
cells. As discussed earlier, HSCT and 
engineered immunotherapy are already 
broadly used approaches that could ben- 
efit from synthetic biology’s components 
and circuits. From one perspective, en- 
gineered immunotherapy already takes 
advantage of synthetic components by ex- 
pressing synthetic genes such as CAR [28, 
29]. However, rather than constitutively 
expressing engineered components, syn- 
thetic biology’s synthetic circuits could 
control the activation of cancer-targeting 
capabilities, limiting their activity to oc- 
casions when they are most necessary. 
By engineering HSC grafts and T-cell 
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Fig. 6 Bacteria can be synthetically 
programmed to invade cancer. (a) A 
population-density-dependent invasion of can- 
cer cells can be created in bacteria. E. coli 

was engineered to express invasin (inv) at 
high population densities by borrowing the 
Luxl/LuxR quorum-sensing apparatus. Invasin 
binds fy-integrin receptors and induces uptake 


grafts, these approaches could be useful in 
limiting graft-versus-host complications, 
maximizing graft-versus-tumor benefits, 
and limiting cytokine responses. 

Asa step toward these goals, Smolke and 
colleagues recently constructed a synthetic 
device that regulated T-cell proliferation 
by using a synthetic RNA-based device. As 
shown in Fig. 7a, programming cells with 
this device allowed the execution of so- 
phisticated processes upon implantation. 


Cancer cell 


by targeted cells; (b) Intravenously delivered 
engineered bacteria were created that express 
invasin while also suppressing oncogene ex- 
pression. By expressing a catenin B-1-specific 
short hairpin RNA (shRNA), oncogene ex- 
pression was knocked down as listeriolysin 
(LLO) expression allowed escape from the 
phagosome. 


This device differs significantly from the 
synthetic transcriptional control compo- 
nents discussed earlier. Rather, by linking 
drug-responsive, ribozyme-based regula- 
tory devices to genes targeted by growth 
cytokines, the investigators were able to 
control mouse and primary human T-cell 
proliferation [5]. The same group also 
demonstrated the ability of these syn- 
thetic controllers to modulate the T-cell 
growth rate in response to drug input 
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(a) A synthetic RNA device that controls T-cell proliferation 
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(b) Pausing T-cell activation with engineered bacterial effectors 
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Fig. 7. T-cell behavior can be regulated us- 
ing synthetic biology. (a) A T-cell proliferation 
regulatory system was engineered based on 

a programmable, drug-mediated synthetic ri- 
bozyme switch. A self-cleaving ribozyme was 
fused to a small-molecule-binding aptamer. 
The synthetic component was introduced into 
the 3’ untranslated region (UTR) of an mRNA 
transcript for a cytokine-targeted gene. In the 
absence of an inducer, the ribozyme under- 
goes self-cleavage, eliminating the poly(A) tail 
(pA) from the transcript’s open reading frame, 


in vivo. Ultimately, this process could be 
customized for individual patients, and 
used to regulate the expansion of T-cell 
grafts upon implantation. 
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thus disrupting proper translation. However, in 
the presence of an inducer, the aptamer un- 
dergoes a conformational change, which inac- 
tivates the ribozyme and allows translation to 
occur; (b) Engineered bacterial effectors, OspF 
and YopH, can inhibit specific steps in the 
T-cell response pathway. As a result, off-target 
effects of transplanted T cells, such as the 
cytokine storm effect in adoptive immunother- 
apy, could be paused by a synthetic switch 
based on these new synthetic components. 


As another step towards engineer- 
ing modified T-cell behavior, Lim and 
colleagues recently leveraged engineered 
effector proteins from bacteria to rewire 
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kinase-mediated responses systematically, 
both in yeast and mammalian immune 
cells [108]. Lim et al. recognized that bac- 
terial effector proteins can be directed 
to inhibit specific mitogen-activated pro- 
tein kinase pathways in yeast by arti- 
ficially targeting engineered effectors to 
pathway-specific complexes. Ultimately, 
they were able to use the new effector 
proteins to build circuits with their previ- 
ous protein scaffold technology and rewire 
cell behavior. Furthermore, as shown in 
Fig. 7b, they were able to use the ex- 
pression of these new effectors to tune 
the T-cell response (TCR) pathway and 
to pause T-cell activation. This approach 
could be used as a potential switch to 
limit deleterious cytokine responses and 
abrogate the need for steroid therapy fol- 
lowing engineered immunotherapy. In the 
future, both of these approaches could 
be engineered into T-cell transplants, as 
well as HSCT, to provide a long-term, 
drug-inducible treatment for cancer that 
could be reactivated in cases of partial 
remission. 


4.3 
Deploying Synthetic Circuits in Mammalian 
Cells to Treat Broad Diseases 


Whilst the enhancement of HSCT and 
engineered immunotherapy are clearly 
potential benefits of applying synthetic 
biology to cell therapy, a broad range 
of diseases might also be treated by 
transplanted human cells with synthetic 
circuits. For example, by creating synthetic 
circuits and capabilities that allow human 
cells to produce beneficial biomolecules or 
to process deleterious substances, diseases 
beyond cancer could be treated. Moreover, 
these cells could ultimately be combined 
into new engineered tissues that might be 
used to treat different aspects of a disease. 


Two recent examples from Fussenegger 
and colleagues have revealed how other 
diseases could be treated. First, the group 
recently designed a synthetic mammalian 
gene circuit to regulate uric acid home- 
ostasis in vivo. Disturbances in uric acid 
are associated with tumor lysis syndrome 
and gout [109]. In response to these is- 
sues, the investigators created the first 
implantable closed-loop synthetic circuit to 
treat the disease. As shown in Fig. 8, their 
synthetic circuit sensed uric acid using 
an engineered repressor that was induced 
by uric acid. Upon de-repression, the cir- 
cuit expressed an engineered urate oxi- 
dase, allowing the elimination of uric acid. 
When cells that expressed the synthetic 
circuit were implanted subcutaneously in 
urate oxidase-deficient, transgenic mice, 
a decrease in urate concentrations was 
observed which returned subpathological 
levels, as well as a reduction in uric acid 
crystal deposition in the kidneys. 

In a second study, Fussenegger et al. 
controlled blood glucose levels by linking 
engineered melanopsin signal transduc- 
tion to a synthetic nuclear factor of ac- 
tivated T cells (NFAT). In this way, they 
were able to create a synthetic signaling 
cascade by using light-inducible transgene 
expression in cells implanted into mice. 
In animals with subcutaneous implants, 
serum levels of the human glycoprotein se- 
creted alkaline phosphatase (SEAP) could 
be transdermally regulated by direct illu- 
mination. Furthermore, when light was 
used to control the expression of the 
glucagon-like peptide 1, it was possible 
to attenuate glycemic excursions in mice 
with type II diabetes [6]. 

Ideally, these approaches could be de- 
ployed in any cell of the body. However, 
with the development of induced pluripo- 
tent stem cells (iPSCs), many new cell 
therapies may be possible. These cells 
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Processing uric acid with a synthetic circuit 


Fig. 8 Therapeutic networks can be de- 
ployed in cells to regulate disease. A pros- 
thetic network for the treatment of tumor ly- 
sis syndrome and gout was engineered into 
implantable sensor-effector cells. The cells 
monitored serum urate levels constantly by 
importing urate via a transgenic human uric 


can be created from an adult patient’s 
cells by engineering the expression of 
only four genes (e.g., KLF4, c- MYC, OCT 
3/4, and SOX2 (KMOS)) [110]. Unfortu- 
nately, although this method is seen as 
a significant breakthrough, it has several 
drawbacks [111], notably as a result of 
the need for viruses to introduce extra 
copies of the genes permanently into the 
cellular genome, which causes cells to be- 
come prone to tumor formation. Rossi and 
colleagues recently addressed this prob- 
lem by chemically transfecting cells with 
synthetic, modified RNA molecules that 
would function as mRNA transcripts for 
the key genes [112]. These different tech- 
nologies point towards a future where 
it will be possible to develop synthetic 
circuits that regulate the genes that are 
critical to diseases, to upload these into 
patient-specific pluripotent stem cells, and 


Sensor-effector cell 


Blood vessel 


Allantoin 


Time 


acid transporter (URAT1). Its presence in- 
duced an engineered repressor, allowing the 
expression of a urate oxidase (smUOX). Urate 
oxidase then converted urate into allantoin, al- 
lowing the lowering of uric acid levels in live 
mice. 


to engineer new tissues that can be im- 
planted as treatments. 


4.4 
Artificial Cell Nanofactories as Therapeutics 


Moving beyond the traditional concept of a 
cell, artificial cellular nanosystems present 
a unique opportunity for developing cellu- 
lar therapies enabled by synthetic biology. 
As noted previously, drugs encapsulated 
in liposomal constructs can be viewed 
as cellular systems from a biomimetic 
perspective. Furthermore, several inves- 
tigators have proposed that engineering 
these encapsulations with more complex 
internal machinery could yield nanofac- 
tories that produce or process critical 
molecules in vivo [66, 67], in a similar 
manner as the synthetically engineered 
mammalian cells that process uric acid 
and glucose. Just as in other fields of 
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synthetic biology, intricate studies have 
been performed to understand the engi- 
neering concepts underlying the design of 
these systems. For example, the laborato- 
ries of Schwartz and LeDuc recently exam- 
ined the effect of molecular crowding on 
synthetic circuit behavior. These authors 
expanded upon their previous mathemati- 
cal model of molecular crowding [113], and 
created several cell-free synthetic circuits. 
In living cells, macromolecular crowding 
can dramatically influence biochemical ki- 
netics via volume-exclusion effects, which 
reduce diffusion rates, while enhancing 
the binding rates of macromolecules. As 
shown in Fig. 9a, the size of the crowd- 
ing molecule can be manipulated to alter 
the behavior of different synthetic circuits, 
producing negative feedback and also al- 
lowing the circuit to be finely tuned [65]. 
These approaches will be critical in build- 
ing cellular nanofactories as biomimetic 
cell therapies. 

As a therapeutic example, Mas- 
trobattista and colleagues recently 
encapsulated a reconstituted bacterial 
transcription-and-translation network 
along with DNA encoding a model 
antigen (f-galactosidase) [114]. Their 
system (see Fig. 9b) was shown to 
produce a functional antigen in vitro, 
and was then deployed in mice, where 
antigen-expressing liposomes were able 
to generate a higher humoral immune 
response in comparison to control 
vaccines that consisted of liposomes 
encapsulating only the antigen, the 
transcription-and-control network, or the 
DNA template, respectively. The system 
can be easily altered for other antigens 
by changing the DNA template and, as 
a result, could prove useful in creating 
safer vaccines with lower risks than those 
based on attenuated, live virus. When 
constructing these types of encapsulated 


transcription-and-translation circuits, the 
liposomal reaction environment must be 
precisely controlled, as shown in another 
recent study [65]. 

When using liposomal encapsulations 
such as those described here, the poten- 
tial for manipulating their surface prop- 
erties could provide another therapeutic 
advantage. For example, by adding simple 
components to the lipid membrane, prop- 
erties such as size, charge and chemical 
function could easily be modified. These 
types of surface modifications could also 
be tailored for therapeutic purposes [62, 
114]. 


5 
Safety 


Particularly because these systems will 
be deployed in the body, synthetic con- 
structs could carry significant risk. As a 
result, the previously described efforts in 
inducing pluripotency [112] or vaccinat- 
ing [114] without viral vectors are critical. 
Furthermore, the long-term pharmacolog- 
ically regulated expression of genes (e.g., 
therapeutic genes) is key to the success 
of many of the cell-based therapies en- 
visioned here. As such, it is critical that 
the regulatory system — that is, the circuit 
or transgene expression system — is main- 
tained in human cells. Notably, many of 
the engineered transcription factors that 
are bacterial in origin can be immuno- 
genic in a clinical setting, as shown by 
Rivera et al. [115]. To combat potential im- 
munogenic responses, recent advances in 
synthetic biology have included significant 
developments in engineered eukaryotic 
proteins and domains, as opposed to bac- 
terial [6, 109, 116]. For example, Khalil 
et al. recently described a system using en- 
gineered zinc-fingers to create broad new 
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Fig. 9 Approaches for providing biomimetic 
cell therapy through engineered artificial cell 
nanofactories. (a) Nanofactories consisting 
of liposome-encapsulated DNA templates, 
RNA polymerase and ribosomes can be en- 
gineered to produce key outputs. Gene ex- 
pression in these systems can be regulated 
by crowding the system with large and small 
molecules. These added crowding molecules 
limit diffusion, while enhancing the binding 
kinetics. As a result, gene circuits can be 
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tuned by their addition, such as in regulat- 
ing a negative feedback circuit; (b) A synthetic 
nanofactory can be developed for vaccina- 
tion. Immunostimulatory, antigen-expressing 
nanofactories were constructed to pro- 

duce a model antigens in vivo for vacci- 
nation. In live mice, antigen-expressing 
liposomes generated a higher humoral 
immune response compared to control 
vaccines. 
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classes of eukaryotic transcriptional acti- 
vators that can be combined to form logic 
gates and, potentially, many other circuit 
types. 

Additionally, in any of these synthetic ef- 
forts, itis easy to imagine synthetic circuits 
that could malfunction with disastrous ef- 
fects. For example, if a proliferation circuit 
were to malfunction, then the unchecked 
cell proliferation could overwhelm a pa- 
tient. As a result, Collins and colleagues 
recently created a synthetic kill-switch in 
bacteria that parses several different inputs 
to activate riboregulators that control the 
expression ofa lethal gene [74]. The system 
could potentially be integrated into more 
complex circuits, activating when aberrant 
circuit behavior is detected. Along these 
lines, synthetic classifiers could also be- 
come important parts of therapies. For 
example, Benenson, Weiss and colleagues 
recently described a synthetic logic circuit 
based on RNAi for the classification of spe- 
cific cancer types [84]. These types of logi- 
cal classifier could be deployed throughout 
cell therapies to determine when the en- 
grafted cells had malfunctioned and, upon 
making that determination, they could ac- 
tivate synthetic kill-switches. As a result, 
much of synthetic biology’s future contri- 
butions to cell therapy may involve enhanc- 
ing the safety of already-existing therapies, 
allowing them to be more broadly pre- 
scribed. Finally, the ability to safely and 
predictably alter cellular behaviors could 
revolutionize the pharmaceutical industry 
[117]. 


6 
Concluding Remarks 


Although cell therapy is a widely used 
medical intervention which has achieved 
many successes, it has the potential to 


be optimized using synthetic biology’s 
constructs and circuits. While approaches 
that have leveraged synthetic biology to 
engineer therapies show promise, many 
technical challenges still need to be over- 
come. As an example, new approaches will 
require the design of toolkits of compo- 
nents and regulatory modules that can act 
predictably in vivo [117]. The approaches 
and discoveries described in this chapter 
demonstrate the field’s immense poten- 
tial, while illustrating some of its obvious 
limits. Synthetic circuits provide particular 
promise in engineering effective cell thera- 
pies because they allow classical engineer- 
ing control modules and functions to be 
deployed in the tuning of cell phenotypes. 
Ultimately, synthetic biology approaches 
have rapidly reconfigured the genetic en- 
gineering landscape over the past decade, 
and with so many groups bringing engi- 
neering principles to bear on biological 
networks, it is clear that synthetic biology 
will continue its expansion of cell therapy 
techniques. 
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Keywords 


Codon usage (also referred to frequently as “codon bias’’) 

The genetic code is degenerate — a particular amino acid can be encoded by more than 
one codon. Organisms exhibit a preference to use particular synonymous codons in 
higher frequency than would be expected by chance. 


Codon pair bias (CPB) 

There are 3721 possible codon pairings or adjacent codons. (For an encoding of the 
dipeptide, A; Az, 61 non-stop codons can encode A; and also Az, 61 x 61 = 3721.) Due to 
degeneracy, each possible dipeptide (400 total) can be encoded by several codon pairings. 
Codon pair bias is the preference of organisms to use certain codon pairings to encode 
adjacent amino acids than would be expected to occur by chance and independent of 
codon usage. Codon pair bias is overlapping. For the tripeptide, X1X7X3, there will be 
two codon pair encodings, one for XX and another for X2X3. CPB is calculated 5’ to 3’ 
and generally not commutative: X1X2 does not usually express the same CPB as X2X}. 


Shuffling/scrambling 

During the computer-aided recoding process, the specific use of each synonymous codon 
encoding a particular amino acid is maintained. However, in shuffling or scrambling, 
the position of those synonymous codons within the genome has been altered. In effect, 
this is a method to generate different codon pairings without introducing any change in 
the overall codon usage. 


Synthetic attenuated virus engineering (SAVE) 

The recoding of viral genomes so that codon pairings, which are normally 
under-represented in the genome space of a particular organism, become 
over-represented. As these unfavorable codon pairings accumulate within the genome, 
the virus becomes attenuated. The level of attenuation is tunable by adjusting the 
number of unfavorable codon pairings. 


Deoptimization 
Recoding of genome so that under-represented codon pairings are preferentially used. 


Several approaches to vaccine design exist, including inactivated and live attenuated 
vaccines. Live attenuated vaccines tend to promote longlasting immunity by 
activating both cellular and antibody responses. By taking advantage of technologies 
developed in synthetic biology, attenuated polioviruses and influenza viruses have 
been synthesized and shown to protect susceptible mice from infection. The live 
attenuation of these viruses was achieved by a novel approach to vaccine design, 
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synthetic attenuated virus engineering (SAVE), in which changes to the preferential 
usage of certain codons and codon pairs to encode amino acids and adjacent amino 
acid pairs have been incorporated into the virus during synthesis. By choosing 
the extent of under-represented codons or codon pairs incorporated (which also 
results in a concomitant increase of rare CpG and UpA dinucleotide frequencies), 
viruses can be attenuated in both cell culture and mouse models in a tunable 
fashion. This approach has been extended to HIV retroviruses and the bacterium, 
Streptococcus pneumonia. With the increased pace of advance in vaccine designs, a 
view that includes disease eradication rather than just control is advocated. With 
initial positive results in viral attenuation using SAVE, investigations in other virus 
families may show SAVE to have a broad applicability. 


1 
Introduction 


Vaccines have been invaluable in pro- 
tecting human health against infectious 
disease. Several types of vaccines exist, 
including inactivated vaccines, live atten- 
uated vaccines (LAVs), subunit vaccines, 
toxoid vaccines, and DNA vaccines. Here, 
attention will be focused mostly on new 
synthetic biology methods of producing 
live attenuated viral vaccines, while infor- 
mation relating to bacterial vaccines [1, 
2], antifungal vaccines [3] and vaccines 
against parasitic diseases such as malaria 
[4, 5] is available elsewhere. 

A live attenuated viral vaccine consists 
of a live, replication-competent virus that 
is genetically weakened in some sense so 
that it is incapable of causing disease. The 
virus can grow in a host, and thus can 
stimulate powerful, longlasting, cellular 
and antibody immune responses provid- 
ing immunity against the wild-type (wt) 
virus. The key to a safe and efficacious 
LAV is achieving a balance between repli- 
cation on the one hand, and attenuation on 
the other hand, so that a strong immune 
response is provoked without causing dis- 
ease. For some of the most successful 
LAVs, the virus is attenuated in biologically 
special ways that preferentially diminish 


the ability to cause disease, without greatly 
affecting the ability to replicate. Several 
such examples are discussed below. 

LAVs have a number of advantages over 
other common types of vaccine, such as 
inactivated vaccines. First, they provoke 
both cellular (e.g., T cell) and humoral 
antibody responses, whereas inactivated 
vaccines tend to provoke mainly antibody 
responses. Second, as they are typically 
very similar to the pathogenic virus, the 
responses of LAVs are relatively strong and 
longlasting, often life-long. Third, LAVs 
can be effective in very small amounts, 
and can also be very inexpensive to 
manufacture. 

However, LAVs also have disadvantages, 
the first problem being that some attenu- 
ated viruses are weakened due to a small 
number of mutations, and so they can re- 
vert to a pathogenic form at a low, but 
significant, rate. This has been a problem 
with live attenuated poliovirus vaccines, 
for instance [6, 7]. Second, a live attenuated 
virus might be pathogenic in an individ- 
ual with a compromised immune system. 
Worse, the vaccine virus may engage in a 
chronic infection of the vaccine recipient 
with devastating consequences. This too 
has been a problem with the live attenu- 
ated poliovirus vaccines. Third, traditional 
methods of making live attenuated viruses 
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and achieving the right balance between 
replication and attenuation are empirical, 
trial-and-error methods, which can be slow 
and do not always yield the ideal balance. 
The first and third of these three disadvan- 
tages can largely be addressed by modern 
synthetic biology methods of vaccine de- 
sign, as discussed below. 

The original live attenuated viral vaccine 
was Edward Jenner’s cowpox vaccine of 
1797, used to protect against smallpox. In 
this case, the “trick” allowing a favorable 
balance between replication and attenua- 
tion was to use a virus of a non-deadly 
poxvirus species (cowpox) to protect 
against a different, deadly human poxvirus 
species (smallpox). Interestingly, although 
Jenner’s work on cowpox and smallpox 
arguably underlies the whole field of im- 
munology, and directly or indirectly saved 
hundreds of millions of lives, Jenner’s 
original manuscript describing his obser- 
vations and submitted to the Royal Society 
for publication, was rejected! [8, 9]. 

A second highly successful LAV was that 
developed against yellow fever in 1937 by 
Max Theiler (later awarded a Nobel Prize). 
By 1927, a team from the Rockefeller 
Foundation had found that African yellow 
fever could be transmitted to Indian 
rhesus monkeys (three members of the 
African field team making this discovery 
contracted yellow fever and died). Theiler 
was able to further propagate the virus first 
in mouse brains, then in chick embryos. 
(During this work, Theiler also contracted 
yellow fever, but survived; by 1931 there 
were 32 known laboratory cases of yellow 
fever, with five deaths [10].) After a total of 
176 passages in chick embryos, the yellow 
fever virus strain called 17D had arisen 
by mutation and adaptation to the chick 
cells, and simultaneously lost virulence in 
humans. This strain was used as a LAV. 
Between January 1941 and April 1942, 


the vaccine was given to seven million 
US army recruits, without inducing any 
known case of yellow fever. 

The success of the yellow fever vaccine 
showed that attenuation could be achieved 
by passaging a pathogenic virus in some 
other cell type (often, a nonhuman cell) 
or under some other condition. The in- 
troduction of tissue culture [11] and viral 
quantification by plaque assay [12], com- 
bined with the idea of passaging virus 
in an alternative host cell, quickly led to 
new LAVs for the treatment of polio [13], 
measles [14], mumps [15], and rubella [16]. 

Sabin’s live attenuated poliovirus vac- 
cine is especially well known. Polioviruses 
types 1, 2, and 3 were passaged in monkey 
kidney cells at subphysiological temper- 
atures, eventually generating attenuated 
variants. The type 1 strain had 57 nu- 
cleotide substitutions compared to the 
progenitor (virulent) strain, while types 
2 and 3 had only 2 and 10 substitutions, 
respectively. Only a few mutations that ac- 
cumulated during the numerous passages 
proved essential in attenuating neuroviru- 
lence. However, a key and primary atten- 
uating mutation in each serotype appears 
to be a mutation in the viral internal ri- 
bosome entry site (IRES). This mutation 
is thought to mitigate protein translation 
in most cells of the organism, including 
those of the nervous system but, critically, 
has only a minimal negative effect and 
thus allows translation, in the cells of the 
gut. Consequently, these polio strains are a 
type of host-range mutant — they replicate 
efficiently in the gut (where they provoke 
an immune response but cause little, if 
any, disease), yet they are inhibited from 
replicating in neural tissue (where they 
would cause disease). It is surprising that 
the Sabin vaccines strains, carrying only 
very few attenuating mutations, have been 
so successful in curtailing the incidence 


Synthetic Biology Approaches for Vaccine Development 


of poliomyelitis worldwide. The lucky fact 
is that the target tissue of poliovirus repli- 
cation is not the central nervous system 
(CNS) but the gastrointestinal tract. In- 
fection of the CNS is an accident and 
of no advantage for poliovirus; it occurs 
in the case of polioviruses types 2 and 3 
only once in every 2000 infections of naive 
humans. Nevertheless, the very few mu- 
tations — although lowering the probabil- 
ity of neuronal complications — puts these 
viruses at risk of reversion to virulence, 
which does indeed occur [7]. Moreover, 
because poliomyelitis is an irreversible, 
disastrous disease even a low rate of 
vaccine-associated paralytic disease has be- 
come a large health burden. A second 
major problem with the live poliovirus 
vaccines emerged when it was discovered 
that they have the propensity to genetically 
recombine with common yet benign hu- 
man enteroviruses, the C-cluster coxsackie 
viruses. Consequently, both wt poliovirus 
and coxsackie virus are human enterovirus 
species C (HEV-C) viruses that can re- 
combine with the vaccine strains. The po- 
liovirus/coxsackie virus recombinants are 
sometimes highly neurovirulent [17] and 
can circulate as circulating vaccine-derived 
poliovirus (CVDPV) in poorly immunized 
communities, thereby causing outbreaks 
of poliomyelitis [18-21]. 

A final example is the recently li- 
censed live attenuated influenza vaccine, 
FluMist®. This virus was grown for 32 
passages in chicken kidney cultures at 
25°C to produce a temperature-sensitive/ 
cold-adapted virus, which grows well at 
25 °C but very poorly at 37 °C. Itis thought 
that this virus is, like the polio vac- 
cine, a type of host-range mutant — it may 
grow relatively well in the nasal passages 
where temperatures are slightly lower than 
37°C, but poorly in the lungs where the 
temperatures are higher. Growth in the 


nasal passages provokes immunity, while 
a lack of growth in the lungs prevents 
respiratory disease. There have been fears 
that live attenuated flu vaccines with seg- 
mented genomes could engage in genetic 
reassortment with virulent flu strains, but 
so far these fears have not been justified. 
Indeed, “... after nearly a decade of using 
the licensed live vaccine in the field, such 
worrying reassortant viruses with virulent 
phenotypes have simply not been found” 
(R.B. Belshe, personal communication). 
The live attenuated viruses discussed 
above were a result of empirical, trial-and- 
error attenuation by repeated passaging in 
nonhuman organisms and cultured cells 
under various conditions, including un- 
usual temperatures. However, this strat- 
egy does not always result in a workable 
vaccine. An acceptable balance between 
replication and attenuation may not occur 
[22], and even if it does some attenuated 
viruses can too easily revert to a virulent 
form through mutation [18] or recombina- 
tion [6, 19, 20]. Historically, the attenuation 
of virulence has not been achieved by ra- 
tional design; rather, achieving the correct 
degree of attenuation, and a small prob- 
ability of reversion, has been a matter of 
chance. Modern methods that do use ra- 
tional design should be able to do better. 


2 
Synthetic Approaches to Vaccine Design 


As highlighted by the preceding discus- 
sion, an ideal method for virus attenuation 
would be tunable, so that any desired de- 
gree of attenuation could be achieved, and 
also be non-revertable. It is suggested that 
both goals can be achieved simultaneously 
by introducing a very large number of 
mutations (hundreds to thousands), each 
of which can potentially cause only a tiny 
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defect. However, because the defects are 
very small, any particular reversion would 
not bestow to the revertant sufficient fit- 
ness for selection; moreover, because there 
are hundreds of defects, reversion to vir- 
ulence would be so improbable as to not 
be a practical concern. In addition, any 
degree of attenuation could be ‘“‘tailored” 
by introducing more or less deoptimized 
codon pairs. 

In principle, synthetic biology and gene 
synthesis provide several ways of im- 
plementing this ‘death by a thousand 
cuts’’ [23] strategy. Since demonstrating 
the chemical synthesis of poliovirus with- 
out natural template in their laboratory 
in 2002 [24], the present authors have in- 
vestigated the use of synthetic, recoded 
viruses for the development of LAVs. As 
gene synthesis has become more afford- 
able, from US$12 per base pair (bp) in 
2000 [25] to US$0.28 per bp in 2013, the 
cost of synthesizing whole viral genomes 
has decreased dramatically, so that recod- 
ing a whole virus genome according to 
some theoretical design principle is now 
feasible. 


2.1 
Codon Usage Recoding 


One approach that has been developed for 
designing attenuated viruses is based on 
the phenomenon of codon usage. Most 
amino acids can be encoded by more than 
one codon (e.g., Leu, Arg, and Ser can 
each be encoded by six codons), but each 
organism has a codon usage preference; 
that is, it uses some of these synonymous 
codons more frequently than others [26, 
27]. Fig. 1 shows a pentapeptide and its 
possible synonymous encodings — 192 in 
this case—and therefore, for any given 
protein there are many ways to encode 
that protein. A large body of literature 


Gly - Ala - Met - Phe - Leu 
GGA GCA CUA 
GGC GCC a~yq UUC cUC 
GGG GCG uuu = CUG 
GGU GCU CUU 
UUA 

UUG 

Fig. 1 Pentapeptide degenerate 


encodings. Each amino acid except 
methionine and tryptophan can be 
encoded by multiple codons. This 
particular pentapeptide has a total 
of 4x 4x 1x 2x 6=192 possible 
encodings. 


has shown that genes encoded with rarely 
used codons tend to have low levels 
of expression. Thus, one method for 
attenuating the function of a gene would 
be to create a synthetic version of that gene 
that is recoded with rarely used codons. For 
example, averaging across many different 
tissues in humans, the amino acid leucine 
is encoded by CUG 40% of the time (the 
most common codon), but by CUA for 
only 7% of the time (the rarest codon) 
[28]. Thus, a codon usage deoptimization 
strategy might involve recoding a large 
number of codons (e.g., all Leu codons to 
CUA, all Ser codons to UGG, etc.). 

This strategy was used by Mueller 
et al. [28] and Burns et al. [29] to at- 
tenuate poliovirus by recoding the P1 
(capsid-encoding) region of poliovirus. 
Mueller et al. [28], using PV1(Mahoney), 
introduced 680 mutations so as to direct 
the usage of rare codons for all amino 
acids, thereby generating the virus PV-AB. 
The completely deoptimized P1 region 
(yielding the capsid precursor with wt 
amino acid sequence) in the otherwise wt 
PV1(M) genome rendered the construct 
nonviable, as shown in Fig. 2. However, 
subclones containing shorter regions of 
deoptimized encodings were viable and, 
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Fig. 2. Growth phenotypes of P1 region recoded polioviruses. Poliovirus is encoded by 7513 nucleotides, of which the 
P1 capsid region ranges from positions 743-3385. The upper panel shows growth characteristics of codon usage re- 
coded polioviruses. P1-SD virus contains the largest possible number of codon position changes (scrambling) while 
maintaining the original codon usage. P1-AB virus contains the largest possible number of rarely used synonymous 
codons. Both, P1-SD and P1-AB are recoded polioviruses that investigate effects of changes in codon usage. P1-AB was 
found to be a nonviable virus. The lower panel shows the growth characteristics of codon pair-deoptimized polioviruses. 
P1-Max contains over-represented codon pairs. P1-MinXY contains deoptimized codon pairs encompassing nucleotides 
755-2470, while P1-MinZ contains deoptimized codon pairs encompassing nucleotides 2471-3385. P1(M)-WT is shown 
for comparison with the plaque titration at the top (*) the control for P1-SD and P1-AB and (**) the control for P1-Max, 
P1-MinXY, and P1-MinZ. The growth kinetics for all viruses is shown to the right. 
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on a per-particle basis, were neuroattenu- 
ated. As a control, Mueller et al. created 
the virus PV-SD, in which synonymous 
codons in the P1 region were ‘“‘scram- 
bled” (see below), producing a virus with 
wt codon usage but 934 silent nucleic acid 
mutations. This scrambled virus grew sim- 
ilarly to wt (Fig. 2), showing that poor 
codon usage-—and not nucleotide rear- 
rangement — was the reason for the poor 
growth of PV-AB and its derivatives. Sim- 
ilarly, Burns et al. focused on nine amino 
acids, and recoded the P1 region of the 
Sabin vaccine strain of PV2 such that, for 
these nine amino acids, only the most rare 
synonymous codons were used. Again, 
this recoding attenuated the virus, and by 
examining a number of different viruses 
with different degrees of recoding Burns 
et al. [29] showed that the degree of viral 
attenuation was proportional to the degree 
of recoding to rare codon usage. Repeated 
passage of the recoded viruses with altered 
codon use, however, yielded phenotypic re- 
vertants. More recently, Nougairede et al. 
[30] used ‘‘random codon re-encoding”’ on 
Chikungunya virus, also achieving attenu- 
ation. To specifically address the issue of 
the rate of reversion, Bull et al. [31] atten- 
uated phage T7 by recoding to use rare 
codons, and indeed found that reversion 
was very slow (though not quite so slow as 
expected on theoretical grounds). 


2.2 
Codon-Pair Bias 


From these experiences exploring 
genome-scale manipulations of po- 
liovirus, the effects of altering codon-pair 
bias (CPB) were investigated subse- 
quently. CPB, which is different and 
independent from codon usage, is the 
bias for certain adjacent pairs of codons 
to occur more or less frequently than 


expected [32]. That is, after normalizing 
for codon usage, some codons are more 
frequently or less frequently found 
adjacent to other codons. For example, 
CUU CGA is a very common (good) 
encoding of Leu Arg, whereas CUU 
AGG is a very rare (bad) encoding. 
This phenomenon can be quantified by 
introducing the “codon pair score’ (CPS): 


F(AB)o 
PAY"F(B) Ecxy) 
F(X)*F(Y) 


CPS =In (1) 


where F denotes frequency, F(AB)o is 
the observed frequency of a particular 
codon pair, and F(XY) is the frequency 
of the corresponding amino acid pair. 
The denominator calculates the expected 
F(AB) based on the codon usage found in 
a core set of 14795 consistently annotated 
human genes [23]. The “CPB” can then be 
calculated as the arithmetic mean of the 
individual CPSs, calculated for each pair 
in the gene or genome of interest, 


k 
CPS; 
CPB = 2 
a (2) 


i=1 


CPB is plotted against gene length in 
showing the CPB score of 14795 annotated 
human genes (Fig. 3). The peak of the 
distribution has a positive CPB of 0.07, 
which is the mean score for all annotated 
human genes. 


3 
Attenuation by CPB Recoding 


By using a heuristic algorithm developed 
at Stony Brook, whole viral genomes can be 
computationally recoded to use common 
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Fig. 3. Codon pair bias in humans. The codon 
pair bias of each annotated human gene is 
plotted in relation to gene length (number of 
amino acids). In general, most human genes 
exhibit slight positive codon pair bias, cluster- 
ing in the CPB space from 0 to 0.2. Codon 
pair-deoptimized poliovirus constructs as 

well as wild-type and Max are plotted on the 


or rare codon pairs while retaining wt 
codon usage and amino acid sequence [23], 
and controlling the folding free energy of 
the RNA [33, 34]. The computer-designed 
viral genome is then chemically synthe- 
sized de novo, and the whole virus re- 
generated by reverse genetics in the case 
of RNA viruses. This process is termed 
“Synthetic Attenuated Virus Engineering” 
(SAVE) [23]. 


3.1 
Poliovirus 


Using SAVE, two new _ polioviruses 
were initially designed, synthesized and 
characterized. These were PV-Min (631 
synonymous nucleotide changes, CPB = 
—0.474) and PV-Max (566 synonymous 
nucleotide changes, CPB=-+0.246), 
which contained codon pairs that were 
either under-represented relative to the 
human genome or over-represented, 
respectively, in the P1 region of the virus 
(2643 total nucleotides) [23] (see table in 
Fig. 3). Wild-type poliovirus, PV1(M), has 


Codon pair bias 


same figure. Deoptimized polioviruses possess 
more under-represented codon pairs result- 
ing in a negative CPB ranging from —0.19 

to —0.47. The table shows the actual CPB of 
various recoded polioviruses, the number of 
nucleotide changes, and the resulting growth 
phenotype. 


a CPB that is slightly negative at —0.02. It 
should be noted that the frequencies of 
CpG and UpA nucleotide pairs were also 
increased for the deoptimized viruses, 
whereas the numbers of CpG and UpA 
dinucleotides between codons in PV-Max 
were significantly reduced compared to 
wt PV1(M) (see discussion in Sect. 4). 
The PV-Max virus was found to produce 
a 90% cytopathic effect within 24h, 
similar to results found for the wt. On 
the other hand, PV-Min produced no 
visible cytopathic effect, even after 96h, 
and no viruses were isolated from the 
supernatant after four blind passages. 
This was quite remarkable, as no rare 
codons have been introduced into these 
recoded genomes [23]. As before, the 
recoded P1 region was divided into three 
segments (X, Y, Z) to reduce the length of 
the codon pair-deoptimized region; the 
segments were then subcloned separately 
into the PV1(M) background, so as to 
produce two new viruses, PV-MinXY 
(407 synonymous nucleotide changes, 
average CPB = —0.32) and PV-MinZ (224 
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synonymous nucleotide changes, average 
CPB =-—0.19). These viruses were viable 
but severely debilitated, as gauged by 
plaque phenotypes and growth kinetics 
after 72 h incubation (see Fig. 2). The viral 
attenuation was found to be due primarily 
to a reduced specific infectivity, which 
showed remarkable differences to that of 
the wt PV1(M) [23]. 

PV-MinXY and PV-MinZ were ad- 
ministered intracerebrally into CD155 
transgenic mice to determine whether a 
protective immune response could be con- 
ferred. CD155 mice were administered a 
dose of 10° viral particles per week for 
three weeks. The PLDso (the amount of 
viral particles that cause paralysis/death 
in 50% of infected mice) of wt poliovirus 
was 10* particles, and 10 days after the fi- 
nal injection a robust immune response 
was confirmed by neutralization assay. 
Challenge of the mice with lethal doses 
of wt poliovirus did not lead to death or 
paralysis. These initial findings thus sug- 
gested that codon pair deoptimization of 
poliovirus could be an effective platform 
for developing a live attenuated virus. As 
reversion to virulence is a primary con- 
cern with live attenuated viruses, a check 
was also made for sequence or phenotypic 
reversion after passages in tissue culture 
cells or in mice. Reversion to a virulent 
phenotype was not found in either case. 
Although referring to the approach of mak- 
ing hundreds of mutations of small effect 
“death by a thousand cuts’’ [23], based 
on ongoing findings it is unlikely that 
each nucleotide change will have the same 
detrimental effect on overall viral fitness. 
Interestingly, it was found that PV-Max did 
not produce a super-virulent poliovirus, 
but rather mimicked the wt. Competition 
assays between PV-Max and wt performed 
in the present authors’ laboratory did not 


result in a clear winner after six pas- 
sages (O. Gorbatsevych and M. Arabov, 
unpublished results), similar to findings in 
another laboratory using the same recoded 
viruses [35]. Although PV-Max would be 
expected to outcompete the wt virus, it 
was speculated that the cost to viral fit- 
ness of the reduced CPS of the wt virus 
would outweigh the benefits of a more 
positive CPS (codon optimization), either 
because the wt virus has already reached 
maximal fitness within a particular cel- 
lular system so it would not be possible 
for PV-Max to replicate any faster, or 
that codon optimization above a certain 
extent (e.g., CPS >0.0) would not have 
very much effect on replication or transla- 
tion. 


3.2 
Scrambled Poliovirus 


Interestingly, the SAVE platform was also 
used to discover previously unknown sec- 
ondary structures of the poliovirus RNA, 
using a “scrambled design” or shuffling 
[36]. In the scrambling method, the wt 
codons of the whole viral genome are 
used, but as many synonymous codons are 
scrambled (shuffled, see Keywords) around 
the genome as possible (preserving both 
codon usage and amino acid sequence) giv- 
ing rise to low, high, or wt CPB scores, as 
desired. Currently, the terms ‘‘scramble”’ 
and “shuffle” are used interchangeably. 
As an example of shuffling, two poten- 
tial codings for the pentapeptide LRLSR 
are shown in Fig. 4. The top encoding, 
UUG AGG CUU UCGG CGU, uses more 
common codon pairs, resulting in a CPB 
of +0.33. Without changing the codons 
used or the amino acid sequence, the first 
and third codons are swapped, resulting in 
the shuffled encoding, CUU AGG UUG 
UCG CGU. As a result of this shuffle by 
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Fig. 4. Example of shuffling/scrambling. Shuf- 
fling or scrambling changes the position of 
the original codons used in the pentapep- 
tide, LRLSR. Here, the codons both express- 
ing leucine are shuffled. While preserving 

the codon usage and amino acid sequence, 


one codon, more under-represented codon 
pairs occur, which has a drastic effect on 
the CPS, decreasing it to —1.79. In this 
particular case, shuffling decreased the 
CPB score but, as desired, shuffling can 
either increase the CPB or leave it at wt 
levels. 

By using this scrambling technique and, 
in effect maximizing nucleotide changes 
in the scrambles, large segments of the 
poliovirus genome were recoded, keeping 
the amino acid sequence, codon usage and 
CPB unchanged, in the hope of disrupt- 
ing essential RNA secondary structures. 
As a test run, the P2 coding region of 
the poliovirus open reading frame (ORF) 
was scrambled, knowing that it contained 
the hairpin cre (cis replication element), 
a structure which is absolutely essential 
for viral replication. Indeed, the P2 scram- 
ble yielded a nonviable virus, yet when 
the cre was rebuilt within the scrambled 
P2 sequence the virus replicated with wt 
kinetics [36]. Following this strategy, two 
unique, functionally redundant RNA el- 
ements were found in the P3 region of 
poliovirus. The ablation of both secondary 
structures yielded an attenuated virus that 
grew with titers which were about 2.5 
log worse than wt and also produced 
tiny plaques, but the presence of either 
RNA element was sufficient for optimal 


0.10 0.17 0.40 -0.34 ¥ = 0.33 Good codon pair score 


lie, Me a Ae 
UUG AGG CUU UCG CGU 
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CUU AGG UUG UCG CGU 
SW 


—0.82 -0.15 -0.48 -0.34 ¥ =-1.79 Bad codon pair score 


the new sequence incurred a much poorer 
codon pair score as a result of shuffling. It is 
also possible to produce sequences that are 
neutral (little change) and increased (more 
over-represented pairs) in CPS as a result. 


growth of the virus [36]. Thus, another 
method to attenuate viruses for use in 
LAVs would be through genome-scale 
changes that perturb the viral RNA sec- 
ondary structures. This has been demon- 
strated with positive-strand RNA viruses 
(e.g., poliovirus), and could also be used 
for other viruses [37]. It is likely that this 
scrambling method would also allow es- 
sential secondary structures to be found 
in negative-stranded RNA viruses, such 
as influenza [38]. In fact, silent mutations 
are being introduced into the influenza 
genome which are thought to disrupt es- 
sential secondary structures in a novel 
manner, in order to design a LAV for 
influenza [39]. It should be reiterated that 
many RNA secondary structures are essen- 
tial in the viral life cycle, and the disruption 
of such structures essentially “kills” the 
virus. More often than not, however, essen- 
tial higher-order structures map outside 
the ORFs [37]. These higher-order struc- 
tures of importance must be identified and 
“rebuilt” within an ORF, and this can be 
an added benefit in research using recoded 
viruses [36]. It follows that this approach 
to live attenuated virus design can be more 
challenging, as RNA secondary structures 
must first be found that will attenuate the 
virus upon perturbation, without “killing” 
the virus. 
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3.3 
Influenza Virus 


Effective vaccines for polio have existed 
since the mid-twentieth century, with both 
an inactivated polio vaccine (IPV) devel- 
oped by Jonas Salk made available in 1955, 
and an oral LAV (oral polio vaccine; OPV) 
developed by Albert Sabin made available 
in 1963 [40]. There is, therefore, a relatively 
small medical need for a more effective 
polio vaccine (although a more perfect vac- 
cine would be useful for the endgame of 
poliovirus eradication; see below). With 
poliovirus serving as a proof of concept for 
the SAVE vaccine development platform, 
attention turned to developing effective 
LAVs for other viral diseases that posed 
large global health burdens. Currently, 
much attention is being devoted to sev- 
eral plus-stranded and negative-stranded 
RNA viruses [41, 42] using the SAVE plat- 
form, with the most progress reported for 
the negative-stranded influenza virus. 

In the United States, there are an es- 
timated 25-50 million cases of influenza 
infections annually, typically as a result 
of seasonal epidemics, with approximately 
225 000 hospitalizations [43]. The World 
Health Organization (WHO) estimates the 
global disease burden from influenza at 
up to one billion cases annually. The most 
notorious of all recorded flu pandemics 
caused by the infamous 1918 “Spanish” 
influenza virus [44] resulted in 50-100 
million deaths worldwide, with a mortality 
rate >2.5% compared to <0.1% for typi- 
cal flu pandemics. Most recently, a deadly 
outbreak of H7N9 avian flu in China was 
reported in which 126 cases resulted in 24 
deaths, or a mortality rate of 19% [45, 46]. 
Current inactivated trivalent vaccines (con- 
sisting of three components: H1N1 and 
H3N2 influenza A and an influenza B) are 
based on technologies developed before 


and during the 1960s [47]. Flu vaccines 
in the U.S. and Europe are still grown in 
embryonated eggs (a seasonal influenza 
vaccine produced in cell culture by No- 
vartis was recently approved by the FDA 
in 2012) and, due to length of time to 
manufacture, vaccine components must 
be determined by regulatory bodies many 
months prior to their actual use. As a re- 
sult of rapid antigenic drift and changes 
in circulating viruses, recommended virus 
strains in the vaccine are regularly inef- 
fective for the viruses encountered during 
seasonal epidemics [48, 49]. The other ma- 
jor class of flu vaccines consist of the 
cold-adapted LAV (FluMist®) that is ad- 
ministered by nasal spray, but this is only 
approved for patients aged between 2 and 
49 years [50], which makes it unavail- 
able for that segment of the population 
most susceptible to severe presentations 
of disease, namely infants and the elderly. 
Although these current vaccines do exist, 
there remains a need for new vaccines that 
are easier and faster to produce, are effec- 
tive in the general and immunologically 
naive (e.g., infants) population, safe, and 
economically feasible. 

Major differences in genome structure 
between poliovirus and influenza virus 
may impact the level of attenuation us- 
ing SAVE. Poliovirus is a plus-stranded 
RNA virus with a single genomic tran- 
script (essentially its genome is mRNA, 
and can be directly translated). The en- 
tire genome of poliovirus, encoding 10 
proteins, is translated continuously, and 
this results in the formation of a polypro- 
tein that requires autocatalytic proteolytic 
cleavage to form viral proteins [51]. In con- 
trast, the genome of the most prevalent 
influenza viruses are segmented into eight 
(sometimes seven) negative-stranded RNA 
genes [52], as shown in Fig. 5. With these 
considerations, investigations were made 


Fig. 5 Influenza virion. Influenza A virus con- 
tains eight negative-sense, segmented RNAs, 
which binds to the viral polymerase complex 
and nucleoprotein (NP). HA is the antigenic 


as to whether the SAVE platform would 
produce an effective influenza LAV. 
Working with an influenza A virus 
(strain A/Puerto Rico/8/1934), large parts 
of the polymerase subunit B1 (PB1), 
nucleoprotein (NP), and the antigenic 
hemagglutinin (HA) proteins were re- 
coded, and this resulted in 314, 236, 
and 353 nucleotide changes, respectively, 
as shown in Table 1 [53].” In addition 


1) In a previous communication [54], the 
analysis was reported of a virus carrying 
the three codon pair-deoptimized genes 
PB1, HA, and NP (PR8*"). Although two 
viruses (PB1+HA+NP)™" and (PB1+ HA)™" 
were codon pair-deoptimized and generated, 
only data analyzing the (PB1+HA)™™ [54] 
were presented, while assuming — due to 
a mix-up of stocks — that the data obtained 
were obtained with (PB1+HA+NP)™™". 
Thus, all data reported in the communi- 
cation by Mueller et al. [54] refer to PR8 
with deoptimized genes PB1 and HA 
[(PB1+ HA)™"]. Interestingly, the deoptimized 
virus (PB1+HA+NP)™" (PR8*F) was found 
to be very sick and grew very poorly even on 
MDCK cells (unpublished results). 
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glycoprotein, hemaglutinin, which binds to 
sialic acid expressed on cellular surfaces. NA 
is neuraminidase, a glycosidase that promotes 
viral release from infected cells. 


to calculated CPB, the changes in CpG 
and UpA frequencies are also shown (see 
Sect. 4 below). The deoptimized segments 
were synthesized de novo and cloned into 
an eight-plasmid reverse genetics system 
[54]. Each deoptimized segment PB1™", 
NPp™n, and HA™" was transfected into 
susceptible cells with the complementing 
seven wt segments. Each virus was found 
to be viable, as shown by the plaque phe- 
notype in Fig. 6. All of the mutant viruses 
formed plaques that were either indistin- 
guishable from wt, or just slightly smaller. 
It was found that, at 50h post-infection of 
Madin-Darby canine kidney (MDCK) cells 
with virus, most mutant viruses grew to 
an approximately 10-fold lower titer than 
wt, while PB1™ grew to about 100-fold 
lower titer (see table in Fig. 6). It was 
also found (unlike poliovirus) that the 
specific infectivity was not affected, though 
the levels of deoptimized proteins were 
reduced compared to other proteins from 
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Tab. 1 Recoded PR8 influenza A virus (strain A/Puerto Rico/8/1934). 

Virus A Nucleotides CPB’of CPB” LDso° PDso° Frequency Frequency 
deoptimized (PFU) (PFU) UpAs CpGs 

segment 

wt N/A N/A N/A 3.2x10! 1.0 x 10° - - 

Hamn 353/1775 —0.420 0.019 1.7x103 = nd# — 13792) 95 (30) 

Nam 265/1413 ~0.396 0.002 24x10° nd  137(82) 81 (30) 

(NA+HA)™" 618/3188 —0.409% 0.011 >3.2x 10° 24x 10° 274 (174)  176(60) 

PB1-S1™™ = -236/2341 —0.386 0.007 3.2 x 104 nd 139 (98) 81 (41) 

Npmin 314/1565 —0.421 0.012 5.0 x 10? n.d 101 (52) 101 (43) 


“CPB: codon pair bias. 


'TDso: the median lethal dose, killing 50% of infected mice. 
°PDso: the median protective dose, protecting 50% of vaccinated mice from challenge by 10° PFU of 
wt PR8 (3000 x LDso) challenge on day 28 post-infection. 


4nd: not determined. 
°Wild-type frequencies are in parentheses. 
/ Weighed average codon pair bias. 


OSS 


PR8 HAMn  pBiMin 


Max titers @ 50 hrs p.i. 
PR8 (wt) 6.0 x 108 


NpPMin 


HAMin 1.5x 108 


Fig. 6 Plaque phenotype of codon 
pair-deoptimized PR8 influenza A 
viruses. The plaque assays were 
performed on MDCK cells. Maximum 
titers from growth kinetics studies 
are shown in the table at 50h post 
infection. 


the same virus, or the same protein in 
wt-infected cells. 

The most antigenic components of the 
influenza virus are the glycoproteins HA 
and neuraminidase (NA), which are not 
directly involved in viral genome repli- 
cation. In fact, these antigenic polypep- 
tides by themselves are necessary and 


sufficient to induce protective immunity 
against infection (although cellular im- 
munity clearly plays a role in clearing 
infections). In spite of this premise, this 
dogma was deliberately violated by synthe- 
sizing two influenza viruses with recoded 
and deoptimized glycoproteins, NA™™” and 
(NA+ HA)™® variants [55]. Similar to what 
was found with the HA™ virus [53], 
both the NA™" and (NA+HA)™™ vari- 
ants were viable. Surprisingly, these codon 
pair-deoptimized variants reached titers in 
MDCK cells which were only slightly lower 
than that of wt PR8 (see Fig. 7). Interest- 
ingly, the level of attenuation exhibited 
by the three deoptimized viruses (NA™™", 
HA™", (NA+HA)™") did not adhere to 
the additive property. Unexpectedly, the 
(NA+ HA)™" variant replicated to a maxi- 
mal titer of 2.1 x 10° plaque-forming units 
(PFU) ml", slightly higher than either the 
NA™™" or HA™” variants. Plaque assays 
further revealed that the three variants 
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Fig. 7. Growth phenotype of PR8 influenza 
viruses with deoptimized glycoproteins HA 
and/or NA. The upper left panel shows plaque 
phenotypes in MDCK cells infected with 
HA™", NA™", and (NA+HA)™" viruses. The 


shared similar plaque size, though slightly 
smaller than those of wt PR8 (Fig. 7). 

The unexpected robust growth property 
of the (NA+HA)™™" variant was found 
to be dependent upon the tissue culture 
cell type used for propagation. Whereas, 
only slightly attenuated in MDCK cells, 
the (NA+HA)™™ variant barely grew in 
adenocarcinomic human alveolar basal 
epithelial (A549) cells, reaching titers that 
were 3-4 logs lower than the progenitor 
wt PR8 in growth kinetics comparisons 
(Fig. 7). The reason for this remarkable 
attenuation of the (NA+HA)™" variant 
might be an intact interferon signaling 
pathway in A549 cells [56]. 

As expected, the expression of both 
glycoproteins NA and HA was found to 
be reduced in MDCK cells infected with 
the (NA+ HA)™® variant compared to the 
wt PR8. Originally, it was hypothesized 
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upper right panel shows growth kinetics of de- 
optimized viruses in MDCK cells through a 
48h infection cycle. The lower panel shows 
growth kinetics of (NA+HA)™" virus relative 
to wt PR8 in interferon competent A549 cells. 


that defects in viral translation were the 
main contributor to attenuation as a result 
of codon pair deoptimization. However, 
while this may be true for the reduction 
of the HA protein, the decrease in NA 
is most likely related to mRNA stability. 
The precise mechanisms leading to the 
reduction of NA and HA expression are 
still under investigation. 

The (NA+HA)™" variant was found 
to be highly attenuated in Balb/C mice. 
The LDso in mice was determined to be 
3.2 x 10° PFU, making it the most attenu- 
ated virus synthesized by our lab to date. 
It was assumed that the deoptimization 
of NA (NA™") contributed mainly to the 
attenuation rather than HA™®", since the 
LDso of the NA™®” virus was determined 
as 2.4x 10° PFU compared to 1.7 x 103 
PFU for HA™®", In mice, unlike in cell 
culture, it seems that a greater degree of 
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attenuation is observed with more deop- 
timization. The PDso (protective dose 50; 
the viral dose that provides protective im- 
munity to half of the mice in the test 
group) of the (NA+HA)™®" variant was 
assessed and found to be highly protec- 
tive in Balb/C mice, with a value of only 
2.4 PFU, the lowest of all influenza vaccine 
candidates to the authors’ knowledge, and 
resulting in a LDso/PDso ratio >1.3 x 10°! 
The LDs0/PDs0 ratio is seen as the vaccine 
“safety margin” of a given virus; the larger 
the ratio, the larger range of doses can be 
explored for safety and efficacy in a vac- 
cine. In addition, the protection induced 
by the (NA + HA)™" virus was longlasting, 
being effective for at least seven months. 
The (NA + HA)™” virus was also found to 
be highly protective against a heterologous 
challenge of two different mice-adapted 
H3N2 viruses, influenza A/Aichi/2/1968 
and A/Victoria/3/75, with PDso-values of 
237 PFU and 147 PFU, respectively. Based 
on these results, it is believed that the 
recoded (NA+HA)™" virus is a promis- 
ing vaccine candidate, and it is suggested 
that the SAVE platform could be effec- 
tive in producing a vaccine for influenza. 
Of course, many further investigations 
will need to be conducted to support 
the claim that codon pair-deoptimized in- 
fluenza viruses, such as the (NA + HA)™™ 
variant, are candidates for the develop- 
ment of human influenza virus vaccines. 
For example, the pathogenesis and protec- 
tion profiles of the influenza virus strains 
need to be tested in ferrets [57], which 
is considered an important animal model 
for the evaluation of potential flu vaccine 
candidates. 

It was recently noted [55] that once 
the genome sequence of a newly evolved 
virulent influenza virus strain circulating 
in seasonal epidemics or pandemics has 


been identified, the computer-aided de- 
sign of codon pair-deoptimized NA and 
HA sequences can be achieved very quickly 
(in minutes), while the construction of 
the corresponding attenuated vaccine can- 
didate (by chemical synthesis of cDNAs 
and reverse genetics) can be achieved in 
a very short time (days). Dormitzer et al. 
[58] meanwhile, have presented an alterna- 
tive approach which is also dependent on 
the chemical synthesis of influenza virus 
genes, and leads very rapidly to influenza 
vaccine candidates. The nature of the vac- 
cine candidates described here [53] and by 
Dormitzer et al. [58] is very different. In the 
present authors’ candidates, attenuation 
rests predominantly on the deoptimiza- 
tion of the glycoprotein genes and possibly 
one other replication protein (though the 
amino acid sequences of the viral pro- 
teins, and hence their immunogenicity, 
remain absolutely unchanged). However, 
in the approach described by Dormitzer 
et al., the genes for HA and NA remain 
unchanged and attenuation will be by the 
design each time of altered amino acid se- 
quences of replication proteins. Which of 
these strategies presents a faster approach 
to better vaccines remains to be seen. 


3.4 
Recoding HIV 


Recently, Martrus et al. recoded the HIV-1 
gag and pol genes by either codon pair op- 
timization or codon deoptimization (intro- 
ducing rare codons) to investigate whether 
replication capacity would also be altered 
for a retrovirus [59]. Retroviruses dif- 
fer from both poliovirus and influenza 
virus in that, after retrotranscription of 
the viral genomic RNA into cDNA, the 
double-stranded viral DNA is permanently 
integrated into the host genome. These 
authors reported significant changes in 
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replication phenotypes in tissue cultures 
of codon pair-deoptimized HIV variants, 
an observation which suggested that the 
SAVE strategy could also be used to tailor 
the gene expression of lentiviruses. 


3.5 
Recoding Prokaryotes 


Whilst many of the authors’ recent studies 
have focused on using SAVE to apply to 
human pathogenic viruses, Coleman and 
colleagues have recently shown this strat- 
egy to be effective also in attenuating the 
virulence of the human pathogen Strep- 
tococcus pneumonia [60]. This bacterium 
remains the leading global cause of pneu- 
monia in adults and children, despite 
the availability of vaccines [61]. Coleman 
et al. [60] focused on the bacterial gene 
ply encoding the pore-forming toxin pneu- 
molysin (PLY). Ply was deoptimized and 
used to replace the wt gene in the bac- 
terium; subsequent analyses showed that 
the expression of ply and production of 
PLY were decreased in the recoded S. 
pneumoniae strain. More importantly, the 
strain harboring the deoptimized ply was 
not only less virulent in mice compared to 
the wt progenitor, but was also more atten- 
uated than a bacterial strain from which ply 
was deleted altogether [60]. These studies 
provided an important proof of principle 
that the strategy of SAVE is likely ef- 
fective for the construction of bacterial 
vaccines. 


4 
Mechanisms of Attenuation by Recoding 


Current knowledge regarding the mech- 
anism of attenuation by altering codon 
usage and/or CPB is poor, and sev- 
eral possibilities are being investigated. 


Codon usage has been known for some 
time, and several excellent reviews are 
available discussing its possible causes 
and consequences [26, 27, 62]. As has 
been shown for poliovirus and influenza 
virus, levels of both codon- and codon 
pair-deoptimized viral proteins have been 
found to be lower than that encoded by 
the wt genome. However, it is not com- 
pletely clear to what extent the reduced 
protein levels can be accounted for by re- 
duced mRNA levels, or by an inhibition of 
translation. 

Structural aspects of tRNAs and 
their association with ribosomes in 
the aminoacyl-tRNA (A) site and the 
peptidyl-tRNA (P) site [63] may explain 
CPB. It has been proposed that structural 
features of tRNAs may regulate their 
geometry within the ribosome. As a 
result, unfavorable physical interactions 
between the A-site and P-site tRNAs 
may make difficult certain combinations 
of tRNAs, with the predominant effect 
contributed by the A-site tRNA [64]. In 
addition, structural features of the mRNA 
itself arising from interactions between 
codons within codon pairs may also 
affect translation [65]. Another theory 
propounds the parsimonious use of 
tRNAs. In effect, once a certain codon 
has been used, future occurrences of that 
particular amino acid are not encoded 
by codons randomly; rather, codons 
using the same tRNA are preferred [66]. 
This “tRNA recycling model’ [67] may 
arise because tRNA diffusion away from 
the ribosome is slower than the rate of 
translation. 

As a result of tRNA structural features 
and abundance, it has been proposed that 
translational step times can be controlled 
[68]. The codon-specific translation 
rates in vivo in E. coli were found to 
be between five and 21 codons per 
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second [69], but if hairpin structures were 
present in the mRNA then this rate of 
translation decreased to 0.080 codons per 
second [70]. Interestingly, the process of 
mRNA translation has been found to be 
discontinuous and to include pauses that 
are thought to occur when rare codons are 
encountered during translation; however, 
it has been proposed that they are 
necessary in order to produce functional 
enzymes. For example, when rare codons 
were replaced by preferred codons in 
the yeast gene, TRP3, this unexpectedly 
caused a reduction in relative enzyme 
activity [71]. It has been postulated that, 
whilst protein folding occurs cotrans- 
lationally, difficult protein folds may 
require more time to occur properly [72, 
73], and therefore an incorrect maturation 
of viral proteins could arise by the 
codon- or codon pair-deoptimizing viral 
genomes. This may have a greater effect 
on the large poliovirus polyprotein than 
on the much smaller proteins encoded by 
the segmented genes of influenza virus. 
Yet, it must be noted that the impact 
of tRNA abundance on translation rate 
remains controversial [74], and many of 
these theories do not apply broadly, or 
have not been confirmed experimentally. 
Having so far focused on the pos- 
sible consequences of altering CPB on 
correct viral translation and cotransla- 
tional folding, another cause could be 
post-transcriptional, in that mRNA stabil- 
ity could be affected. For example, the 
rates of mRNA decay can differ by more 
than 100-fold between specific transcripts, 
and it is unknown whether SAVE pro- 
duces attenuation by increasing decay. At 
least in the case of codon usage, no sim- 
ple correlation exists between the mRNA 
half-life and codon use [75]. Addition- 
ally, context-specific sequences could be 
introduced during genome recoding that 


could also destabilize the mRNA, such 
as GU-rich elements [76], AU-rich ele- 
ments [77], and even be influenced by the 
genome GC content [78]. These nucleotide 
sequences serve as recognition sites for a 
vast array of proteins involved in RNA 
metabolism, thought to comprise 3-11% 
of the total expressed protein [79], which 
could target mRNA for degradation. In ad- 
dition to sequence-specific elements, RNA 
structural elements also play an important 
role in stability [80]. It should be noted 
that the present authors’ computational 
approach to genome recoding attempts to 
eliminate the introduction of RNA sec- 
ondary structure, though the algorithm’s 
effectiveness has not been investigated. 
Nonetheless, much has been done to 
computationally predict RNA structures 
[81]. 

It was shown in 2009 by Burns et al. that 
changes to codon usage by replacement of 
the poliovirus capsid coding region with 
unpreferred synonymous codons would 
result in a sharp decrease in the replicative 
fitness of type 2 MEF-1 poliovirus strain 
[82], as would be expected. However, the 
main mechanism of attenuation was at- 
tributed to an increase in CpG and UpA 
dinucleotide pair frequencies, which have 
been found to correlate with a decline in 
viral fitness and are normally suppressed 
in viral genomes. It should be noted that 
both a deoptimization of CPB or codon us- 
age would result in increased frequencies 
of CpG and UpA dinucleotides pairs in 
the recoded segments. However, in codon 
pair deoptimization, the increase of CpG 
or UpA dinucleotides results solely from 
the new pairing of existing codons; for 
example, they are all CpG3-1 and UpA3-1 
(the numbers indicate the position of the 
nucleotides, xxC-p-Gxx, in the codon pair, 
where C is in the third position of the first 
codon and G is in the first position of the 
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second codon). Rare codons, on the other 
hand, bring along an abundance of new 
‘“¢nternal’’ CpG or UpA dinucleotides. 

So, why are CpG and UpA disfavored? 
Currently, there is no satisfactory answer 
to this question, but CpG and UpA have 
been known to be under-represented in 
viral genomes since the first genome of a 
eukaryotic virus was sequenced [83] and 
later confirmed by an exhaustive analy- 
sis of eukaryotic RNA genome sequences 
[84]. It has been posited that an alteration 
in innate immune responses to deopti- 
mized viruses is the primary biological 
mechanism of attenuation. Both, Green- 
baum et al. [85] and Jimenez-Baranda 
et al. [86] showed that motifs with high 
CpG in a AU-rich background increased 
a-interferon expression of dendritic cells. 
Changes in innate immune response to 
the deoptimized virus could also account 
for alterations in RNA or protein con- 
centrations, as described above. However, 
experiments with polioviruses with delib- 
erately increased contents of CpG have 
not shown any effect on innate immune 
response in tissue culture (M. Arabov 
and E. Wimmer, unpublished results). 
Moreover, Rubella virus (the cause of 
German measles) has been reported to 
use the expected number of CpG dinu- 
cleotides [87]. Apparently, this common 
human RNA virus can “‘live’’ well with all 
the CpG dinucleotides expected from its 
sequence. 

Admittedly, there is much to be done 
in this area, and preliminary data are 
available which suggest that changes in 
mRNA stability and translation are both 
involved. Interestingly, the introduction 
of under-represented codons and codon 
pairs into transcripts could also be used 
to investigate the inner workings of the 
translational machinery, as well as RNA 
metabolism. 


5 
Eradication as the Goal of Vaccines 


In addition to clean drinking water and 
universal sanitation, vaccines have played 
a vital role in the marked increase of life 
expectancy of humankind over the past 
century [88]. As advances in vaccine de- 
sign (including the SAVE strategy) and 
production occur, a scientific and policy 
shift from the control of infectious dis- 
eases on a local scale to one of eradication 
on a global scale would be a high-risk 
goal, but one with vast potential economic 
and health benefits. For example, the eco- 
nomic benefit to the world from smallpox 
eradication has been calculated as _be- 
ing up to US$1.35 billion annually [89]. 
Moreover, when vaccine campaigns are 
no longer conducted and vaccinations are 
discontinued, it is estimated that savings 
of US$1.5 billion would accrue annually 
upon the eradication of polio [89]. If the 
benefits outweigh the incremental costs, 
then the ultimate achievement of a vac- 
cination program would be the global 
eradication of an infectious agent [90-92]. 
Unfortunately, however, this task is beset 
not only with potentially insurmountable 
scientific obstacles but also with politi- 
cal and economic problems. To date, only 
two infectious agents — the smallpox virus 
and the rinderpest virus — have been cer- 
tified as eradicated by the World Health 
Assembly, and only one of these affects 
humans. 

Initially conceived in 1958 as a four- 
to five-year program, the global eradica- 
tion of smallpox extended for over 20 
years and included mass and ring vac- 
cination campaigns before the last natu- 
rally occurring case was documented in 
1977 [1]. Although smallpox was declared 
eradicated in 1980 [93], it would be an- 
other 31 years before the rinderpest virus, 
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the causative agent of cattle plague, was 
declared eradicated in 2011 [94]. Since 
1988, inspired by the elimination of wt 
poliovirus-mediated poliomyelitis in the 
Americas, a third eradication effort against 
poliovirus has resulted in a >99% de- 
cline of worldwide poliomyelitis cases [95]. 
Complete eradication has been hindered 
by several problems, however, including 
endemic reservoirs, the re-establishment 
of transmission by importation into pre- 
viously polio-free regions, and outbreaks 
caused by a vaccine-derived poliovirus [6], 
to name but a few (see below). Initially, 
the target for the eradication of poliovirus 
was the year 2000, and although no new 
target date has been set it is still ex- 
pected that poliovirus will be the third 
infectious agent to be eradicated. Other 
infectious agents that could be, or have 
been, targeted for eradication include the 
measles virus [96], mumps virus, rubella 
virus [97], the parasites Dracunculus medi- 
nensis [98], Onchocerca volvulus [99], and 
even hepatitis C virus [100], for which no 
effective vaccine currently exists. Although 
measles-mediated disease has been elim- 
inated in many industrialized nations, 
approximately 750000 children are killed 
annually by the virus in developing coun- 
tries [89]. 

Among the characteristics of infectious 
agents that make them good candidates 
for eradication include: (i) the virus is 
“human” only and has no animal reser- 
voir; (ii) one or more excellent vaccines 
exist; (iii) an effective global surveillance 
program has been established [101]; and 
(iv) accurate and quick diagnostic meth- 
ods have been developed. In the case of 
smallpox, the patients developed symp- 
toms that were easily recognized within 
a relatively short incubation time; more- 
over, only humans can easily transmit 
this species of poxviruses and develop the 


terrible disease (note: specific poxviruses 
exist for numerous animals, and the virus 
specific for cattle served as the crucial 
vaccine). 

In order for the eradication of an infec- 
tious agent to be feasible, it is desirable 
that the vaccine confers longlasting, if not 
life-long, immunity. The smallpox vaccine 
was highly effective (though prone to seri- 
ous side effects) and methods for rapid, 
large-scale immunizations, such as the 
bifurcated needle [102], were developed. 
However, it has been noted that protection 
against smallpox by a common vaccine 
may not confer life-long protection. An al- 
ternative strategy for the eradication of an 
infectious agent has been shown for the 
guinea worm disease caused by D. medi- 
nensis, for which no vaccine currently ex- 
ists. Rather, interruption of the parasite’s 
life cycle by preventing its transmission via 
a microscopic copepod vector is the main 
tool for eradication [103]. 

The obstacles encountered during the 
Global Polio Eradication Initiative (GPEI) 
attest to the difficulties of eradication. To 
be sure, eradication has been highly suc- 
cessful up to the current “end game.” 
As of March 2013, only three coun- 
tries - Afghanistan, Nigeria, and Pak- 
istan — had reported endemic cases of po- 
liomyelitis caused by the wt virus, [104]. 
Unfortunately, outbreaks in previously 
polio-free regions have hindered efforts, 
with outbreaks of poliomyelitis occurring 
in both Kenya and Somalia in June 2013 
[105]. Factors that have slowed the fi- 
nal progress include transmission of the 
virus by infected individuals that exhibit 
no symptoms of disease [106], weak vac- 
cine coverage and occasional weaknesses 
in disease surveillance leading to out- 
breaks, the killing of vaccination work- 
ers [107, 108], poor immunogenicity of 
the oral vaccine in certain populations 
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[109] and, as mentioned previously, po- 
liomyelitis caused by the vaccine (one case 
in 750000 vaccine doses) [110]. Notably, 
WHO issued a global alert in August 2013 
[111] warning of a moderate to high risk of 
wild poliovirus type 1 (WPV1) spreading 
following detection of the virus in sewage 
samples from 24 sampling sites in central 
and southern Israel, where poliovirus was 
eliminated in 1994 [112]. Furthermore, 
WPV1 was isolated in the stool samples 
of 27 healthy children (aged <9 years) and 
one adult, all of whom had been fully im- 
munized. Fortunately, no case of paralytic 
polio has been reported, for which reason 
the circulation of WPV1 in Israel this sum- 
mer and fall (2013) has been described as 
a “silent epidemic.” It is too early to state 
with certainty why WPV1 is circulating in 
Israel, but one plausible hypothesis is that 
immunity in the gastrointestinal tract of 
the population is weak (not the least be- 
cause only IPV has been used in Israel 
since 2000). 

Many of these issues are currently be- 
ing addressed by the WHO. For example, 
the development of effective IPVs, which 
are incapable of causing disease, is under 
way but a major concern is affordability 
in lower-income countries. Nevertheless, 
the GPEI is not without success, hav- 
ing eradicated type 2 poliovirus in 1999, 
leaving only two serotypes still in circu- 
lation [106]. Currently, the SAVE strategy 
is being applied to produce and evalu- 
ate alternative vaccine strains that solve 
issues observed with currently used oral 
polio vaccine (OPV), including recom- 
bination with co-circulating coxsackie A 
viruses that can lead to the generation of 
pathogenic recombinants [17, 113]. With 
continued funding and public support, 
the eradication of polio is expected to be 
realized, although the world population 
may have to be content with an adequate 


control of the disease rather than with 
the total eradication of the wt poliovirus 
strains. 


6 
Concluding Remarks 


The field of synthetic biology has the 
potential to revolutionize vaccine devel- 
opment. Since the synthesis of portions 
of a hepatitis C virus replicon in 2000 
[114] and - in the present authors’ labora- 
tory — the complete poliovirus [24] in 2002, 
there has been great interest in synthesiz- 
ing other viruses [25], although it should 
be noted there was immense initial out- 
rage after the synthesis of poliovirus was 
published [115]. Currently, the largest viral 
synthesis project resulting in the produc- 
tion of infectious viruses is that of the 
severe acute respiratory syndrome (SARS) 
coronavirus at 29.7 kb [116], although it 
would not be surprising that, as synthesis 
costs decrease and technological advances 
are made, even larger viral genomes will 
be synthesized [117, 118]. In fact, one 
large biologically active (replicating) bacte- 
rial genome has already been synthesized 
[119]. Beginning with genome reduction in 
bacteria [120, 121], efforts have culminated 
in a synthesis of the 582 970-base genome 
of Mycoplasma genitalium [122]. Yet, such 
rapid advances will benefit not only vac- 
cine development but also scientific fields 
as diverse as metabolic engineering [123] 
and computing [124-126]. The initial re- 
sults of approaches employing synthetic 
biology to develop viral vaccines appear 
highly promising, with several other viral 
vaccine candidates currently being pur- 
sued, using the SAVE platform, in efforts 
to demonstrate the broad applicability of 
this technology. 
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Keywords 


1,2-Propanediol (1,2-PD) 

A diol with applicative potential, especially as a raw material for the chemical industry. 
1,2-PD is currently produced from non-renewable feedstocks, and is known also to be 
synthesized from sugars by various bacteria and fungi. 


1,3-Propanediol (1,3-PD) 

A diol with applicative potential, especially as a raw material for the chemical 
industry. Its microbial production has recently attracted attention in the valorization of 
biodiesel-derived crude glycerin. 1,3-PD is known to be synthesized from glycerol by 
Clostridia sp., Enterobacteriaceae, and lactic acid bacteria. Production from glucose is 
possible in engineered Escherichia coli. 


1,4-Butanediol (1,4-PD) 

A diol produced only from fossil feedstocks. It is not known to occur in the metabolism 
of any natural organism. Microbial production is possible in engineered E. coli with 
genes of different origins. 


2,3-Butanediol (2,3-PD) 

A diol with applications in various branches of the chemical industry. Its 
production is well studied in members of the Enterobacteriaceae and Bacillus 
species. 2,3-PD is obtained from sugars in mixed fermentations. The possibil- 
ity exits of introducing various carbohydrate-rich industrial byproducts into its 
production. 


Culture conditions 
The physico-chemical parameters under which a culture is performed, such as 
temperature, pH, or redox potential. These parameters often determine the rate 


Metabolic Engineering for the Production of Diols 


of the process or the balance between different products formed. Optimization 
of these parameters is an essential step en route to the maximization of process 
efficiency. 


Medium composition 

The qualitative and quantitative composition of culture broths used for the production 
of various metabolites. The medium composition influences the overall outcome of the 
process, and often has a decisive effect on the cost of the final product. Understanding 
the role of the ingredients of the culture media in the metabolism of cells allows process 
optimization and is useful when designing genetic manipulations. 


Metabolic engineering 

The optimization of various metabolic processes directed at improving the production 
of a metabolite of interest. Metabolic engineering requires an in-depth knowledge of the 
pathways involved in metabolite formation and the regulatory mechanisms involved. 
The tools of genetic engineering may be applied after the so-called “flux analysis” has 
been performed. 


During recent years, the reduced use of chemical processes to produce commodity 
chemicals, and their substitution with microbiological alternatives, has been 
frequently observed. However, because of the low efficiency of these biotechnological 
synthetic materials, investigations are constantly being performed to increase their 
productivity by controlling the metabolism of bacterial cells. One solution to 
this problem would be to modify the environmental conditions of the culture 
of microorganisms or, alternatively, to control the composition of the culture 
medium by using additional and important enzymes that serve as cofactors in 
metabolite production. Such production by microorganisms can be also improved 
by employing genetic engineering tools, whereby modified bacteria are able to 
synthesize several-fold more desirable metabolites than can wild-type strains. The 
biotechnological synthesis of diols such as 1,2-propanediol (1,2-PD), 1,3-propanediol 
(1,3-PD) and 2,3-butanediol (2,3-BD) by the direct microbial bioconversion of 
renewable feedstocks, and even from waste materials of biofuel production, 
has been well described. These diols are widely used in many branches of 
industry, including the production of chemicals, food products, cosmetics and 
pharmaceuticals. As the methods used to increase the efficiency of their production 
are still being investigated, recent developments in the production of 1,2-PD, 1,3-PD 
and 2,3-BD, with regards to the metabolic engineering of production strains and 
the optimization of fermentation processes, are reviewed and discussed in this 
chapter. 
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1 
Introduction 


Nowadays, there is a tendency to re- 
duce the use of chemical processes in 
the production of commodity chemicals. 
During the early twentieth century, most 
bulk chemicals originating from microbes 
were produced by fermenting biomass 
such as potatoes and corn. Today, how- 
ever, the main aims are to develop im- 
proved technologies for the conversion of 
waste biomass (e.g., glycerol and lignocel- 
lulosic materials) into industrially useful 
metabolites production using microbio- 
logical techniques. Indeed, the conversion 
of biomass to high-value specialized chem- 
icals is one of the most important aspects 
of ‘‘green” chemistry [1-3]. 

The main aims of green chemistry are 
to design, develop and implement chem- 
ical products and processes in order to 
reduce or eliminate the use and gener- 
ation of substances that are hazardous 
to human health and the environment. 
Typically, green chemistry is applied to 
pollution prevention and waste minimiza- 
tion [4], with major processes being ap- 
plied to certain solvents (e.g., butanol, 
acetone, isopropanol) and the produc- 
tion of diols, including 1,2-propanediol 
(1,2-PD), 1,3-propanediol (1,3-PD), and 
2,3-butanediol (2,3-BD) [5, 6]. 

Diols are chemicals with two hydroxyl 
groups. Some diols, suchas 1,2-PD, 1,3-PD 
and 1,3-BD (Fig. 1), can be produced 
biotechnologically via the direct microbial 
bioconversion of renewable feedstocks and 
even from waste materials created in 
biofuel production (Fig. 2). Together, these 
diols are considered to be “platform” 
green chemicals [7]. 

Although the microbiological syntheses 
of 1,3-PD and 2,3-BD have been studied 
intensively during the past few years (e.g., 


[8—21]), relatively few reports are available 
regarding the biotechnological production 
of 1,2-PD (see Refs [22—24)). 

These three diols are used in many 
branches of industry: 


e 1,2-PD: The main application of 1,2-PD 
includes the preparation of polyester 
resins for film and fiber manufacture; it 
is also used as deicer in airplane fluids, 
as a nontoxic replacement of ethylene 
glycol in automobiles, as an antifreeze 
in breweries and dairy establishments, 
as an inhibitor of mold growth, and 
as mist to disinfect air [3, 7, 25]. 
Pure stereoisomers of microbiologically 
synthetized 1,2-PD are also used in the 
synthesis of specialty chemicals, such 
as optically active propylene oxide and 
polymers, and also in the manufacture 
of some chiral pharmaceutical products 
[25]. 

e 1,3-PD: This is an important chemical 
product which can be used for 
synthesis reactions, in particular as 


a monomer for polycondensations 
to produce polyesters, polyethers 
and polyurethanes [9]. 1,3-PD is 


also well-known as a monomer for 
the synthesis of polytrimethylene 
terephthalate (PTT), a polyester with 
excellent properties for materials such 
as fibers, textiles, carpets, and coatings. 
It is also used in solvents, adhesives, 
resins, detergents, and cosmetics [7]. 

e 2,3-BD: This has three stereoisomeric 
forms — dextro, levo, and meso—all of 
which are produced via microbiologi- 
cal processes. As the levo isomer has a 
low freezing point (—60°C) it can be 
used as an antifreeze. Both, the levo and 
dextro isomers are chiral components 
for asymmetric syntheses and are used 
in pharmaceutical, agrochemical, fine 
chemical, and food industries, or for 
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Fig. 2. Formation of 1,2-PD, 1,3-PD, and 2,3-BD from different feedstocks [7]. 


liquid crystals [7, 26]. 2,3-BD is easily Recent developments in the production 
dehydrated to methylethylketone (an or- of 1,2-PD, 1,3-PD, and 2,3-BD, with re- 
ganic solvent for resins and lacquers), gards to the metabolic engineering of 
and to butadiene for the manufacture production strains and the optimization of 
of polyesters and polyurethanes. It can fermentation processes are reviewed and 
also be dehydrogenated into acetoin discussed in the following sections. 

and diacetyl, which are flavoring agents 

used in dairy products, margarines and 9 

cosmetics [7]. Possible applications of  1,2-Propanediol (1,2-PD) 

2,3-BD in the production of printing 

inks, perfumes, moistening and soften- 1,2-PD (propylene glycol) is a nontoxic, 
ing agents, explosives, and plasticizers high-demand chemical that is used to 
have also been demonstrated [7, 27]. create polyesters resins, and also in food 
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products, antifreezes, liquid detergents, 
cosmetics, and pharmaceuticals [28]. Itis a 
three-carbon diol, with a stereogenic cen- 
ter at the central carbon atom, and exists 
as the diastereoisomeric forms (R)-1,2-PD 
and (S)-1,2-PD (Fig. 3) (3, 25, 29]. 

1,2-PD, as a bulk chemical, is cur- 
rently produced in a synthetic process 
from propylene oxide, via a nonrenew- 
able petrochemical derivative petrochemi- 
cal route [7]. It can also be produced from 
renewable resources via several routes, 
including the hydrogenolysis of sugars 


Glucose 


| 


Dihydroxyacetone-P 


Methylglyoxal 
S-lactaldehyde alae 
| Acetol 


S-1,2-propanediol 


Fig. 3 Chemical structures of 
1,2-PD stereoisomers [3]. (a) 
(R)-1,2-PD; (b) (5)-1,2-PD. 


at high temperature and under pressure 
in the presence of a metal catalyst [23]. 
Nowadays, there is increasing interest in 
the microbial formation of 1,2-PD and in 
enhancing the efficiency of bacterial fer- 
mentations [1, 30-34], with 1,2-PD being 
produced mainly from glucose by utilizing 
dihydroxyacetone phosphate (DHAP), an 
intermediate metabolite of the glycolytic 
pathway. DHAP is subsequently converted 
to lactaldehyde via methylglyoxal. Both di- 
astereoisomeric forms (S and R) of 1,2-PD 
are formed in this pathway (Fig. 4). 


R-lactaldehyde 


R-1,2-propanediol 


Fig. 4 Metabolic pathways for 1,2-PD production from glucose [23]. 


2.1 
Microbiological Synthesis of 1,2-PD 


The biosynthesis of 1,2-PD can occur 
within a microbial system via two main 
pathways: (i) t-fucose and t-rhamnose is 
metabolized via parallel pathways, medi- 
ated by the sequential action of differ- 
ent enzymes which include a permease, 


Metabolic Engineering for the Production of Diols 


an isomerase, a kinase, and an aldolase 
[35-40]; or (ii) the utilization of DHAP, 
a glycolytic intermediate which is con- 
verted to methylglyoxal by the enzyme 
methylglyoxal synthase (MGS) [3, 25]. 
An alternative route for the production 
of 1,2-PD was reported Elferink et al. 
[41], who investigated the lactic acid bac- 
teria Lactobacillus brevis and Lactobacillus 


Tab. 1 Microorganisms capable of fermenting sugars to 1,2-PD, and yields of the processes. 
Strain Substrate Yield (gg™') Reference 
T. thermosaccharolyticum p-Glucose 0.11 [48] 
HG-8 
L-Arabinose 0.13 
p-Xylose 0.06 
p-Lactose 0.00 
p-Galactose 0.00 
p-Glucose and p-galactose 0.14 
p-Glucose, p-xylose, and t-arabinose 0.14 
Debaryomyces huderoi t-Rhamnose 0.11 [43] 
Debaryomyces kléckeri 0.21 
Pichia pseudopolymorpha 0.19 
Pichia rhodanensis 0.05 
Pichia robertsii 0.05 
Pichia scolyti 0.01 
Pichia wickerhamii 0.27 
Hansenula saturnus 0.02 
Candida polymorpha 0.32 
Torulopsis famata 0.07 
Cryptococcus laurentii 0.02 
Cryptococcus neoformans 0.01 
Bacteroides ruminicola t-Rhamnose 0.42 39 
subsp. brevis B14 
C. sphenoides DSM 614 L-Rhamnose 0.34 45 
p-Galactose 0.29 
T. thermosaccharolyticum p-Glucose 0.20 47 
ATCC 31960 
L. buchneri Lactic acid 0.42 41 
L. parabuchneri 0.42 
Saccharomyces cerevisiae p-Glucose 0.02 50 
(metabolic engineering) 
Escherichia coli AG1 s-Glucose 0.03 [29 


(metabolic engineering) 
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buchneri and found that lactic acid could 
be degraded to acetic acid with the con- 
comitant production of 1,2-PD (albeit with 
traces of ethanol under anoxic conditions), 
without the need for an external electron 
acceptor [41]. 

A number of microorganisms are able 
to ferment sugars to 1,2-PD. The produc- 
tion of this diol by both bacteria and 
yeasts has been reported for Clostrid- 
ium thermobutyricum [42], Candida poly- 
morpha, Pichia robertsii [43], Bacteroides 
ruminicola [39], Salmonella typhimurium, 
Klebsiella pneumonia [44], Clostridium sphe- 
noides [45], Thermoanaerobacterium ther- 
mosaccharolyticum [46-48], L. brevis, and 
L. buchneri [41, 49]. Details of selected mi- 
croorganisms capable of converting sug- 
ars to 1,2-PD, together with the respec- 
tive yields of the processes, are listed in 
Table 1. 

The microbiological synthesis of 1,2-PD 
has been known since the 1950s. As 
early as 1954, Enebo [42] described the 
ability of the bacterium C. thermobu- 
tyricum to produce 1,2-PD [42], while 
almost 14years later, in 1968, Suzuki 
and Onishi [43] reported the ability of 
yeast to produce 1,2-PD. The synthesis 
of 1,2-PD by lactic acid bacteria (L. bre- 
vis, L. buchneri) was first was reported by 
Veiga da Cunha and Foster [49]. Dur- 
ing the microbial synthesis of 1,2-PD, 
other byproducts such as ethanol, acetic 
acid, lactate, acetate, formate and succi- 
nate, are also synthesized [39, 41, 48]. Ex- 
emplary — albeit hypothetical — metabolic 
pathways involved in these processes are 
shown in Fig. 5. 

Although several direct fermentation 
routes for the production of 1,2-PD have 
been documented and many experimental 
trials reported, the production of this 
material remains limited and successful 
schemes continue to be developed. 


2.2 
Factors Influencing 1,2-PD Formation by 
Microorganisms 


Today, biotechnology is providing new 
low-cost and highly efficient fermentation 
processes for the production of chemi- 
cals from renewable biomass resources 
[32]. Methods are constantly being devised 
for increasing the efficiency of micro- 
bial syntheses of chemical compounds 
(including diols), notably by controlling 
the metabolism of the bacterial cells em- 
ployed. One approach to this is to modify 
the environmental conditions of the mi- 
croorganism culture [47] or, alternatively, 
to control the composition of the culture 
medium, for example by the addition of 
enzyme cofactors that are involved in the 
production of metabolites [1, 30, 33, 34]. 


2.2.1 Medium Composition and 
Environmental Conditions 

Carbon, nitrogen, and phosphate sources 
used in the production medium rank 
among very important factors influenc- 
ing the production efficiency of 1,2-PD, 
and of other metabolites. Tran-Din and 
Gottschalk [45] investigated the influence 
of the level of phosphate in the bacte- 
rial medium on metabolite profiles in 
Clostridium sphenoides, a saccharolytic bac- 
terium that does not produce butyrate but 
rather synthetizes ethanol, Hz, CO2, and 
some acetates as products [51]. Tran-Din 
and Gottschalk [45] also noted that these 
bacteria formed 1,2-PD in the medium 
when glucose used as a carbon source, 
under conditions of phosphate limitation, 
and suggested that phosphate concentra- 
tions of 0.4mM and, above all, ethanol 
and acetate, were the major nongaseous 
products. At lower phosphate concen- 
trations, two additional products, 1,2-PD 
and lactate, were detectable in appreciable 
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Fig. 5 Hypothetical metabolic pathways of microbiological 
sugars fermentation. (a) Lactose and galactose; (b) Rham- 
nose; (c) Lactic acid [3, 39, 48]. 
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concentrations. The main conclusions 
from these studies were that propanediol 
appeared in the fermentation broth when 
the phosphate concentration had fallen be- 
low 80uM. Tran-Din and Gottschalk [45] 
also investigated the ability of C. sphe- 
noides to form 1,2-PD from carbon sources 
other than glucose, namely p-fructose, 
cellobiose, t-rhamnose, and t-galactose, 
under phosphate-limiting conditions. In 
the case of t-rhamnose and 1-galactose, 
1,2-PD emerged as the main fermentation 


Biomass <——— Glucose 


| 
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! 


Dihydroxyacetone-P = 


Methylglyoxal 


Ps Lactaldehyde 


| 
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Fig. 5 (Continued). 


Lactaldehyde 1,2-propanediol 


product. The metabolic pathway of glucose 
in C. sphenoides under phosphate-limiting 
conditions is shown in Fig. 6. 

The influence of carbon source on 
1,2-PD formation was also investigated by 
Altaras et al. [48], who examined the ability 
and efficiency of 1,2-PD formation by 
T. thermosaccharolyticum from key sugars 
derived from a lignocellulosic biomass 
and a food processing byproducts stream. 
Subsequently, 1,2-PD was seen to be 
produced from p-glucose, t-arabinose, 


> Glyceraldehyd-3-P 


Pyruvate 


| 


Acetyl-CoA 


_ 


Fig. 6 The metabolic pathway of glucose in C. sphenoides under phosphate-limiting conditions. 


p-mannose, cellobiose, sucrose, D-xylose 
(the three major sugars in the cellulosic 
biomass), and also from t-galactose and 
lactose. However, no 1,2-PD was detected 
when maltose, p-galactose, p-fructose, and 
lactose were used. A high efficiency of 
1,2-PD production was observed in the 
case of glucose (0.28g 1,2-PD per gram 
glucose), whereas when t-arabinose was 
used as a carbon and energy source, 0.13 g 
1,3-PD per gram saccharide was obtained, 
and from xylose 0.06g 1,3-PD per gram 
saccharide was formed. Only traces of 
1,2-PD were detected when other sugars 
were investigated. 

Altaras et al. [48] also tested the possi- 
bility of producing 1,2-PD from a whey 
permeate. Despite the fact that 1,2-PD 
was not synthetized from pure p-galactose 
by T. thermosaccharolyticum, production of 
this diol was possible when whey per- 
meate was used alone as the fermenta- 
tion medium, though it may be supple- 
mented with a yeast extract. As a con- 
sequence, 1,2-PD was formed at a rate 
of 2.4g1-1. Altaras and coworkers stated 
that the addition of yeast extract to the 
hydrolyzed whey permeate medium had 
increased sugar consumption, and this in 
turn had resulted in increased product 
titers, without changing the overall se- 
lectivity. Although the main conclusion 
of Altaras et al. [48] was that T. ther- 
mosaccharolyticum has a great potential 
for 1,2-PD production, this microbial pro- 
cess requires further investigation with 
regards to optimizing the medium compo- 
nents and fermentation methods. Altaras 
et al. [48] also noted that the formation of 
1,2-PD by T. thermosaccharolyticum might 
be significantly increased with the benefit 
of genetic engineering tools. 

Sanchez-Riera et al. [47] investigated 
the influence of environmental factors 
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such as temperature, pH, gases, and sub- 
strate concentration on 1,2-PD synthesis 
in T. thermosaccharolyticum. The fact that 
these bacteria have been found to pro- 
duce methylglyoxal (a precursor of 1,2-PD) 
at higher temperatures than the normal 
growth temperature, suggested that tem- 
perature might have some effect on 1,2-PD 
formation in this case. The maximum 
yield of 1,2-PD (0.20gg"! glucose) was 
obtained at 60°C (the investigated range 
was 50-65 °C), and it was also suggested 
that 60°C might be the optimal temper- 
ature for the growth of T. thermosaccha- 
rolyticum. 

Because the pH of the culture has a 
significant influence on the end prod- 
ucts of anaerobic metabolism of carbon 
sources (under such conditions, many bac- 
teria tend to produce neutral products at 
acid pH, and organic acids at alkaline pH), 
Sanchez-Riera et al. [47] decided to investi- 
gate the impact of pH on 1,2-PD formation 
by T. thermosaccharolyticum, selecting the 
pH range of the experiments to be between 
6.0 and 7.2. The maximum 1,2-PD concen- 
tration (5.60 g1-!) was observed at pH 6.0. 
These authors suggested that, at higher 
pH values, the metabolism would be di- 
verted a higher acid production because of 
a lower proton concentration. 

The same authors also examined the 
influence of gas composition on the effi- 
ciency of 1,2-PD synthesis, since this prod- 
uct is one of the more reduced metabo- 
lites obtained from glucose fermentations 
by T. thermosaccharolyticum. In these 
experiments, the fermentations were run 
under three different conditions: (i) in a 
continuous flow of oxygen-free N2 in the 
head space; (ii) without gas flow; and (iii) 
in a continuous flow of pure H, through 
the head space. However, it emerged that 
the 1,2-PD yield was affected only slightly 
by the different conditions. 
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Currently, very few data are available 
regarding the microbial production of 
1,2-PD by wild bacterial strains, with most 
reports having been made 10-20 years 
ago, or even earlier. Consequently, as 
a low efficiency of microbial production 
results from a slow growth and low 
productivities of anaerobic processes [48], 
it is important to employ the clear benefits 
of genetic engineering to improve bacterial 
strains and, consequently, to increase the 
efficiency of the synthesis of 1,2-PD [7, 23, 
48, 52]. 


2.3 
Genetic Engineering in 1,2-PD Production 


During recent years, there has been an 
increasing interest in the application of 
genetically engineered strains in biotech- 
nological processes involved in the forma- 
tion of 1,2-PD by modified bacteria such 
as S. cerevisiae, E. coli, and Corynebacterium 
glutamicum (22-24, 29, 52]. 

Several organisms ferment sugars to 
produce 1,2-PD, and two major synthetic 
routes have been identified. The first 
route involves the conversion of deoxy 
sugars such as t-rhamnose and 1-fucose 
to lactaldehyde (a precursor of 1,2-PD) in 
E. coli, S. typhimurium, and K. pneumonia 
[38, 44, 53]. The second route utilizes 
DHAP (an intermediate metabolite of the 
glycolytic pathway), which is converted 
to lactaldehyde via methylglyoxal in T. 
thermosaccharolyticum, C. sphenoides, E. 
coli, and S. cerevisiae [45, 48, 54-56]. 
The latter pathway has attracted more 
attention because it can allow for the 
production of 1,2-PD from inexpensive 
sugars [23]. 

When Altaras and Cameron [29] de- 
scribed the metabolic engineering of a 
1,2-PD pathway in E. coli, they focused on 
the development of fermentation routes 


to 1,2-PD from renewable resources. They 
reported the production of enantiomer- 
ically pure (R)-1,2-PD from glucose in 
E. coli expressing NADH-linked glyc- 
erol dehydrogenase (GDH) genes (E. coli 
gldA or K. pneumonia dhaD). Altaras and 
Cameron also showed that E. coli over- 
expressing E. coli mgs produced 1,2-PD. 
The expression of either GDH or MGS 
resulted in the anaerobic production of 
approximately 0.25g 1,2-PD per liter. 
(R)-1,2-PD production was further im- 
proved to 0.7g 1,2-PD per liter when 
MGS and GDH (gldA) were coexpressed. 
In-vitro studies indicated that the route 
to (R)-1,2-PD involved the reduction of 
methylglyoxal to (R)-lactaldehyde by the 
recombinant GDH and the reduction of 
(R)-lactaldehyde to (R)-1,2-PD by a native 
E. coli activity. 

Another microorganism was selected for 
manipulations by Nimii et al. [23], who 
analyzed 1,2-PD production in metabol- 
ically engineered C. glutamicum. In this 
case, wild-type C. glutamicum produced 
93uM 1,2-PD after an incubation of 
132 h under aerobic conditions. Although 
no gene encoding MGS (which cat- 
alyzes the first step of 1,2-PD synthe- 
sis from the glycolytic pathway) was de- 
tected on the C. glutamicum genome, 
several genes annotated as encoding pu- 
tative aldo-keto reductases (AKRs) were 
present. AKR functions as a methylgly- 
oxal reductase in the 1,2-PD synthesis 
pathway. Expression of the E. coli mgs 
gene in C. glutamicum increased the 
1,2-PD yield 100-fold, which suggested 
that wild-type C. glutamicum carries the 
genes downstream of MGS in the 1,2-PD 
synthesis pathway. Furthermore, the si- 
multaneous overexpression of mgs and 
cgR_2242 (one of the genes annotated 
as AKRs), enhanced 1,2-PD production to 
24mM. 


3 
1,3-Propanediol 


1,3-PD (1,3-propylene glycol; 
1,3-trimethylene glycol) is a_ tricar- 
bon alcohol with two hydroxyl groups. It 
forms a viscous, colorless liquid that is 
soluble in water, alcohols, and ethers [57]. 
1,3-PD finds many industrial applications, 
including the production of solvents, 
adhesives, laminates, resins, detergents, 
coolants, and cosmetics [57, 58]; its main 
application is, however, the synthesis 
of polymers, especially polyesters, 
polyethers, and polyurethanes [59]. 

1,3-PD can be obtained through 
physico-chemical means by the hydro- 
formylation of ethylene oxide (Fig. 7) or 
by the conversion of acrolein (Fig. 8). Both 
of these processes utilize nonrenewable 
fossil raw materials and are performed 
under conditions of high temperature and 
pressure. The use of metallic catalysts is 
also required. 


3.1 
Microbial Production of 1,3-PD 


1,3-PD is one of the oldest known products 
of microbial fermentation, and was first 
identified in 1881 by Augusta Freund in a 
mixed culture containing C. pasteurianum 
grown on a glycerin-rich medium [61]. 
In 1914, Voiscenet described rod-shaped 
bacteria in spoiled wine that were capable 


Fig. 7. Hydroformyla- 
tion of ethylene oxide 
[59]. 
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Fig. 8 Conversion of acrolein to 1,3-PD [59, 60]. 
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of synthetizing the diol [62], and studies 
were continued on 1,3-PD fermentation 
by Enterobacteriaceae throughout the first 
half of the twentieth century [63, 64]. In 
1983, the ability of C. pasteurianum to 
produce 1,3-PD was again confirmed [65]. 

Only bacteria are known to produce 
1,3-PD; natural producers of 1,3-PD are 
of the genera Klebsiella [40, 58, 66-68], 
Clostridium [66, 69-74], Citrobacter [16, 
66, 75—78], Pantoea (Enterobacter) [7], Haf- 
nia [78], and Lactobacilli [79-81]. Among 
the above-mentioned genera, strains of C. 
butyricum and K. pneumonia species have 
been the subjects of most intensive studies 
and are considered the best producers of 
1,3-PD [59]. 

The microorganisms showing a native 
ability of 1,3-PD production possess the 
ability to ferment glycerol or mixture of 
glycerol and carbohydrates and release this 
diol, but none can convert carbohydrates 
to 1,3-PD directly [82]. However, genetic 
engineering makes native non-1,3-PD pro- 
ducers, such as E. coli or S. cerevisiae, able 
to synthesize 1,3-PD. Genetically modified 
yeast such as S. cerevisiae are able to pro- 
duce 1,3-PD [83], and genetically modified 
E. coli converts glucose [84-86] or glyc- 
erol [87, 88] to 1,3-PD. An E. coli strain 
engineered by DuPont and Genencor In- 
ternational, Inc., produces 1,3-PD from 
glucose at titers over 130 g1"! [89]. 

Recent developments in the field of mi- 
crobial 1,3-PD synthesis include studies 
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into its production by mixed cultures 
that may have either a defined com- 
position or form undefined consortia. 
During processes that utilize mixed cul- 
tures, a two-way relationship exists be- 
tween microbial composition and envi- 
ronmental conditions. The presence of 
microbes in the medium influence the lat- 
ter’s physico-chemical properties, which 
in turn may simultaneously control the 
qualitative and quantitative composition 
of the population. 

While possibly more difficult to control, 
a mixed culture offers many advantages 
over a monoculture in a production envi- 
ronment, mainly because of the ability to 
adapt to unstable conditions and varying 
substrate quality [90, 91]. It has been also 
often pointed out, that this approach is 
preferable for 1,3-PD production under 
nonsterile conditions [20, 90, 91]. Fur- 
thermore, the correct selection of culture 
member species may allow the simultane- 
ous removal of byproducts (i.e., toxic or- 
ganic acids) by one of the member species, 
thus diminishing the limiting influence 
of these byproducts [92]. In other cases, 
mixed cultures can be utilized for the com- 
bined production of multiple metabolites, 
or for 1,3-PD production from substrates 
other than glycerol where one of the mem- 
ber strains converts another substrate to 
glycerol [26]. In order to maximize the 
effects of such combinations, and allow 
control over the process, metabolic path- 
ways (and their regulation) must be well 
described in all organisms that comprise 
the mixed culture. 

A different approach has also been 
shown [15] whereby a two-step process 
in which a 1,3-PD producer and a strain 
capable of assimilating the byproducts are 
cultured consequently. As the bacteria are 
cultured separately, a more robust control 
over the process should be possible. 


3.2 
Glycerol Metabolism in Microorganisms 
Producing 1,3-PD 


The metabolism of glycerol in microor- 
ganisms that produce 1,3-PD is divided 
between pathways forming two major 
branches, oxidative and reductive. The 
oxidative branch starts with the dehydro- 
genation of glycerol to dihydroxyacetone 
(DHA) by GDH, after which DHA is 
converted to DHAP by the action of dihy- 
droxyacetone kinase. Through this branch, 
glycerol assimilation results in the pro- 
duction of organic acids and alcohols. 
The oxidoreductive nature of these reac- 
tions leads to the formation of an ex- 
cess of NADH). In order to maintain 
the cell redox balance, NAD must be 
recovered, thus the occurrence of the re- 
ductive branch, which consists of two 
reactions. First, the dehydration of glycerol 
to 3-hydroxypropionaldehyde (3-HPA) is 
catalyzed by glycerol dehydratase (GDHt), 
after which 3-HPA undergoes reduction 
to 1,3-PD with the use of reducing power 
from NADH) This former reaction is 
catalyzed by 1,3-propanediol oxidoreduc- 
tase [59]. Thus, the formation of 1,3-PD 
can be considered as a means of regen- 
erating the reducing equivalents which 
are oxidized in the oxidative branch. 
Glycerol metabolism in the best pro- 
ducers of 1,3-PD (K. pneumonia and 
C. butyricum) is shown schematically in 
Fig. 9. The qualitative and quantitative 
spectrum of the byproducts formed in 
the oxidative branch is dependent on 
the microorganism used in the process 
and cultivation conditions utilized. The 
most important environmental factors 
influencing the process of 1,3-PD pro- 
duction are elaborated in the following 
sections. 
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Fig. 9 Glycerol metabolism in the best producers of 1,3-PD. 


3.3 

Influence of Medium Composition on the 
Microbial Production of 1,3-PD from 
Glycerol 


3.3.1 
The formation of 1,3-PD depends mainly 


Glycerol 


on the availability of glycerol in the fer- 
mentation medium. Although glycerol can 
also be the limiting factor when present in 


the medium at high concentration. Barbi- 
rato et al. [66] noted that increasing the 
initial glycerol concentration from 20 to 
70gl-' forced a strong modification of 
the distribution of the carbon fluxes in the 
metabolic pathways. A lower production of 
1,3-PD and a higher production of lactate 
and ethanol by K. pneumonia and C. fre- 
undii were observed when a higher initial 
glycerol concentration was used. Similarly, 
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Zhang et al. [93] determined an optimal 
glycerol concentration of 20g1"! for K. 
pneumonia XJ-Li, while Jalasutram and 
Jetty [94] reported better product yields 
with 20 and 30g1-! than with 90g1"! 
of glycerol in cultures of K. pneumonia; 
moreover, productivity was observed to 
increase up to 40g]! of glycerol, but 
to decrease above that value [94]. The 
inactivation of GDHt at high glycerol con- 
centrations was a likely cause of this effect 
[95]. In cultures of Pantoea agglomerans, 
the amount of byproducts formed (lac- 
tate and ethanol) was significantly lower 
during fermentation on media containing 
higher concentrations of glycerol, which 
resulted in a higher 1,3-PD conversion 
yield [66]. Substrate tolerance is, however, 
a strain-dependent feature. K. pneumonia 
DSM 2026 used by Homann et al. [75] was 
able to ferment 12% glycerol, while C. fre- 
undii grew on media containing up to 10% 
glycerol. 

In the case of C. butyricum, there were 
no significant changes in the end-product 
profile in cultures containing various 
substrate concentrations [66]. Biebl [96] 
showed that when the glycerol concen- 
tration was higher than 60g"! in a C. 
pasteurianum culture, the conversion was 
slower and glycerol remained in the broth. 
As pure glycerol is a costly substrate, the 
use of biodiesel-derived raw glycerol as 
a carbon source has been investigated 
[68, 73, 97-99]. Glycerol is the major 
byproduct of biodiesel production from 
soybean oil or animal fats. In fact, for ev- 
ery 10kg of biodiesel produced, 1kg of 
glycerol becomes available, and this leads 
to an accumulation of raw glycerol. This 
crude glycerin is treated as a typical ‘“‘in- 
dustrial waste-water” [100] as it contains 
significant amounts of impurities (e.g., 
methanol, salts, or free-fatty acids), and 
cannot be utilized in the pharmaceutical or 


chemical industries without purification. 
Consequently, raw glycerol can be used as 
an economical and abundant renewable 
feedstock for microbiology [68]. Chatz- 
ifragkou et al. [101] showed that there 
was no significant difference in 1,3-PD 
production by C. butyricum on pure or 
raw glycerol, although increasing the glyc- 
erol concentration to 80gI-! caused a 
prolongation of fermentation time and a 
decrement in biomass formation. More- 
over, an elevation of lactic acid production 
was observed, which made this acid pre- 
dominant among other coproducts [101]. 
Mu et al. [98] reported that the 1,3-PD con- 
centration obtained by K. pneumonia using 
glycerol from lipase-catalyzed hydrolysis 
was higher than that using glycerol de- 
rived from alkali-catalyzed reaction, but 
both were lower than were obtained from 
pure glycerol; moreover, the end-product 
profile was changed. Ethanol was the ma- 
jor byproduct when K. pneumoniae was 
cultivated on pure glycerol or on glycerol 
derived from alkali-catalyzed hydrolysis. 
When fermentation was carried out on 
glycerol derived from lipase-catalyzed hy- 
drolysis, acetic acid was the main byprod- 
uct, although the lactic acid concentration 
was not determined [98]. Moon et al. [102] 
also reported that 1,3-PD production was 
dependent on the type of raw glycerol, 
with Klebsiella strains being more resis- 
tant to all tested types of raw glycerol than 
Clostridium strains examined. The produc- 
tion of 1,3-PD using crude glycerol by K. 
pneumonia DSM 2026 was shown to be in- 
creased compared to that obtained using 
pure glycerol. Raw glycerol from soybean 
oil-based biodiesel production generally 
caused a lesser inhibition of 1,3-PD pro- 
duction compared to raw glycerol from 
waste vegetable oil-based biodiesel produc- 
tion [102]. Among Clostridia, some strains 
have been proved to tolerate up to 150 g1"! 


crude glycerol [73, 97]. L. diolivorans, a 
promising candidate as a 1,3-PD producer, 
is also able to grow and produce 1,3-PD on 
medium containing crude glycerol; how- 
ever, the amounts of 1,3-PD obtained with 
this low-quality substrate were compara- 
ble to results of cultures performed with 
pharmaceutical-grade glycerol [103]. A few 
studies have been focused on the pro- 
duction of 1,3-PD by C. freundii from 
biodiesel-derived raw glycerol [16, 68, 99]; 
however, C. freundii is also able to produce 
large amounts of 1,3-PD (up to 66.3 gl!) 
from crude glycerol [16]. 

As mentioned previously, the impuri- 
ties present in raw glycerol can inhibit 
microbial growth and the production of 
1,3-PD, and subsequent investigations fo- 
cused on the purification of such low-grade 
substrates have been reported [99]. The 
treatment of crude glycerol using various 
solvents before fermentation has allowed 
the above-described inhibitions to be re- 
duced. For example, the treatment of 
crude glycerol obtained from linseed and 
jatropha with petroleum ether, of crude 
glycerol obtained from rice bran with hex- 
ane, or of soybean-derived glycerol with 
hexane or petroleum ether, has allowed 
satisfactory results to be obtained [99]. 


3.3.2 Carbohydrates 

Since, in native producers, 1,3-PD is 
produced only from glycerol, glycerol is 
not replaced with carbohydrates for the 
purpose of its production, although car- 
bohydrates may be added to the fer- 
mentation medium. These cosubstrates 
can serve various functions in glycerol 
metabolism; for example, they may act 
as acceptors for the reducing equiva- 
lents produced during oxidation in or- 
ganisms unable to grow anaerobically on 
glycerol. In glycerol-fermenting bacteria 
that are capable of fermenting glycerol 
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alone to 1,3-PD, cosubstrates can either 
be fermented independently or can ex- 
change electrons [104]. Glucose is one of 
the most frequently added cosubstrates; 
for example, when a glucose-glycerol 
mixture was present in the fermenta- 
tion medium for C. butyricum, glucose 
catabolism was used to produce energy 
through acetate—butyrate production and 
the formation of NADH, whereas glyc- 
erol was mainly used for the utilization 
of reducing power and the production 
of 1,3-PD. As a result, the 1,3-PD yield 
was increased from 0.57 to 0.92—0.93 mole 
1,3-PD per mole of glycerol in compar- 
ison to cultures performed using glyc- 
erol alone [105]. Biebl and Marten [104] 
also noted that characteristic fermenta- 
tion products of glucose were butyrate 
and an amount of acetate, whereas the 
products of glycerol were 1,3-PD, acetate 
and a small amount of butyrate. When 
a glucose-glycerol mixture was used for 
fermentation with C. butyricum, 90% of 
the glycerol was converted to 1,3-PD, and 
only 10% to acids. In this mixture, glucose 
fermentation provided more electrons for 
the reduction of glycerol as it shifted from 
butyrate to acetate production [104]. Sim- 
ilar results were obtained by Malaoui and 
Marczak [106] with C. butyricum E5 which, 
when cultured on a glucose—glycerol mix- 
ture, converted glycerol into 1,3-PD with 
a yield of 0.89 mol mol, compared to 
0.63 mol mol“! when only glycerol was 
used. 

No increase in 1,3-PD yield after the 
addition of glucose was observed with 
C. freundii [104], and a similar result 
was obtained by Metsoviti et al. [68], 
who found that the replacement of an 
amount of glycerol with glucose or a 
glucose—glycerol mixture had no positive 
effect on the production of 1,3-PD by K. 
oxytoca. However, Jalasutram and Jetty 


635 


636 


Metabolic Engineering for the Production of Diols 


[94], observed yield values of 0.7 and 0.84, 
and productivity values 1.4-1.45 g]-th"! 
in K. pneumonia fermentation when only 
glycerol and a glycerol—glucose mixture 
was used, respectively. A significant lag in 
glycerol assimilation was also observed in 
these cases. 

Glucose is a very important carbon 
source in L. diolivorans fermentation, and 
has been shown to be required for biomass 
formation in the production process of 
1,3-PD from glycerol [103]. The cofer- 
mentation of glucose and glycerol using 
L. diolivorans leads to a shift in glucose 
catabolism from NADH-consuming to 
NADH-reducing reactions, whereby acetyl 
phosphate is converted into acetic acid 
rather than ethanol, such that a 2 mol 
NADH excess remains. The same authors 
also noted decreases in the concentrations 
of lactic acid and ethanol after glucose had 
been consumed when the glycerol concen- 
tration in the medium exceeded 10g1"! 
[107]. Moreover, the concentration of lac- 
tic acid and ethanol increased with an 
increasing glucose: glycerol ratio. Similar 
observations on the influence of the glu- 
cose: glycerol ratio on 1,3-PD production 
were reported by Hiremath et al. [108] for 
K. pneumonia. 

When sucrose was investigated as a co- 
substrate in 1,3-PD production by K. pneu- 
monia [94], its addition allowed a higher 
1,3-PD yield to be obtained than when glu- 
cose was added (0.98 and 0.84 mol PD per 
mol glycerol, respectively). The reason for 
this may relate to the presence of invertase 
in the medium, as the hydrolysis of sucrose 
to glucose and fructose helped to provide 
energy for cell growth, such that the entire 
glycerol content could be converted into 
1,3-PD [94]. Substantial improvements in 
the final concentration of 1,3-PD, the con- 
version yield and productivity were also 
reported by Yang et al. [109] when sucrose 


was added to the fermentation medium of 
K. oxytoca mutant cultures. 

Mixtures of fructose, xylose and arabi- 
nose with glycerol were also monitored 
for 1,3-PD production by L. diolivorans 
[103]. In this case, a fructose—glycerol 
mixture resulted in a higher 1,3-PD con- 
centration than did a glucose—glycerol 
mixture, but a xylose—glycerol mixture 
provided worse results and no 1,3-PD was 
produced during arabinose-glycerol culti- 
vation. The lower 1,3-PD production from 
a xylose—glycerol mixture may have been 
due to a lower NADH production from xy- 
lose compared to glucose. Although xylose 
and arabinose are catabolized via the same 
metabolic pathway, differences exist in the 
enzyme kinetics of the first steps of the 
production of pyruvate, and this may be 
a reason for the decreased availability of 
NADH [103]. 

The influence of glucose, sucrose, mal- 
tose, xylose and starch on the synthesis of 
key enzymes of the 1,3-PD pathway have 
been assessed in K. pneumonia [110]. Typi- 
cally, 1,3-PD oxidoreductase was expressed 
when maltose was present, whereas nei- 
ther glucose, sucrose, starch nor xylose as 
carbon sources triggered the expression 
of this enzyme. The above-mentioned car- 
bon sources also had very little effect on 
the synthesis of GDHt [110]. 


3.3.3. Diols 

1,3-PD is often used in media to moni- 
tor producer strains for product tolerance 
[111] or to isolate microorganisms with 
a high innate tolerance towards 1,3-PD 
[97]. Although 1,2-PD and 1,2-ethanediol 
were each assessed as cosubstrates, both 
were converted to alcohols (i.e., more re- 
duced products) such that the 1,3-PD yield 
was diminished. Consequently, 1,2-PD 
and 1,2-ethanediol are both unsuitable as 
cosubstrates for 1,3-PD production [104]. 


3.3.4 Nitrogen Source 

The influence of both organic nitrogen 
sources (é.g., yeast extract, beef extract, 
tryptone, peptone, malt extract, urea, soy- 
bean meal) and inorganic nitrogen sources 
(e.g., ammonium chloride, ammonium 
sulfate, potassium nitrate, ammonium ni- 
trate, sodium nitrate) on 1,3-PD produc- 
tion was investigated [94, 110]. Among 
inorganic nitrogen sources, ammonium 
chloride [94, 110] and ammonium sul- 
fate [93] seemed to be the most suitable 
for 1,3-PD production. However, complex 
organic nitrogen sources were shown to 
serve as sources of both nitrogen and en- 
ergy, and hence to be more suitable for 
1,3-PD production than inorganic nitrogen 
sources. Although yeast extract was found 
to provide the optimal nitrogen source, its 
addition above an optimal concentration 
led to a decrease in 1,3-PD production [94]. 
Recently, the addition of corn steep liquor 
instead of yeast extract was investigated as 
a nitrogen source for 1,3-PD production 
[112]. 


3.3.5 Organic Acids 

Supplementation of the fermentation 
medium with organic acids such as cit- 
ric, succinic and fumaric acids was also 
investigated [113, 114]. Such additions not 
only improved both cell growth and 1,3-PD 
production but also significantly reduced 
the production of lactic acid, succinic acid 
and ethanol, allowing a better availability 
of NADH for the formation 1,3-PD [113]. 


3.3.6 Vitamins 

As some natural 1,3-PD producers (e.g., 
K. pneumonia, C. freundii, and C. pasteuri- 
anum) require vitamin Bj2 as a coenzyme 
for GDHt, the influence of this vitamin 
on 1,3-PD production was studied. Pfliigl 
et al. [107] observed positive results when 
the vitamin was added to cultures of L. 
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diolivorans, with a higher rate of formation 
of 1,3-PD and a higher concentration ob- 
tained during glycerol fermentation. How- 
ever, Jalasutram and Jetty [94] noted that 
for K. pneumonia, maximum production of 
1,3-PD and the highest activities of GDHt 
and 1,3-PD oxidoreductase occurred in the 
absence of vitamin B;2. However, the latter 
effect may have been caused by the vitamin 
B12-indepence of GDHt of K. pneumonia, 
or its ability to synthesize vitamin B12 
natively. 

When Himmi et al. [71] attempted 
to substitute yeast extract with biotin 
as growth factor, C. butyricum was able 
to convert large amounts of glycerol to 
1,3-PD in a medium containing only 
biotin. However, a significant elongation 
of the lag phase was observed compared 
to fermentation in a medium containing 
yeast extract. 

Attempts were also made to replace 
yeast extract with riboflavin and nicotinic 
acid [107], but neither vitamin supported 
sufficient biomass growth and 1,3-PD pro- 
duction. The substitution of yeast extract 
with a combination of three vitamins (ri- 
boflavin, nicotinic acid, and vitamin Bj) 
allowed a final 1,3-PD concentration that 
was comparable to that obtained with yeast 
extract. These examples indicate that an 
expensive medium component, namely 
yeast extract, may possibly be omitted from 
the culture medium. 


3.4 

Influence of Process Parameters on the 
Microbial Production of 1,3-PD from 
Glycerol 


3.4.1 PH 

Whilst the majority of 1,3-PD fermenta- 
tions are performed at constant pH values 
close to neutral, Barbirato et al. [66] showed 
that the higher pH value in the range of 
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6-8, the later during the time course En- 
terobacter agglomerans cultures would cease 
to function. These authors explained such 
behavior as being related to a decreased 
accumulation of 3-HPA in a more basic 
environment. The problem of 3-HPA ac- 
cumulation does not appear to occur in 
Clostridia, but recent findings have sug- 
gested that the introduction of forced 
fluctuations of pH can have beneficial ef- 
fects on glycerol fermentation processes 
with K. pneumonia [28, 112, 115, 116]. It 
has also been shown that, under such con- 
ditions, metabolism is directed towards 
diol production with a resultant decrease 
in organic acid formation, and increases 
in 1,3-PD yield and process productivity. 

When Biebl [96] analyzed continuous 
cultures of C. pasteurianum to monitor the 
influence of pH on the fermentation pro- 
cess, 1,3-PD formation was shown to be 
independent of the acidity of the environ- 
ment. In fact, among other metabolites 
generated only the production of ethanol 
was related to pH. The observed varia- 
tions were also seen not to have resulted 
from any defined changes in environmen- 
tal conditions, which in turn suggested 
that for C. pasteurianum the regulatory 
mechanisms are largely independent of 
the environment. 


3.4.2. Oxygen Supply 

As noted above, the conversion of glycerol 
to 1,3-PD is the result of an anaero- 
bic metabolism that, in Clostridia, occurs 
only under strictly anaerobic environmen- 
tal conditions. Recently, however, Leja 
et al. [117] isolated and characterized a 
C. bifermentans strain that was capable 
of assimilating glycerol and producing 
1,3-PD under both microaerophilic and 
strictly anoxic conditions. No reports have 
yet been made, however, aimed directly 
at assessing the influence of dissolved 


oxygen levels on the production of 1,3-PD 
from glycerol by Clostridia. It has been 
shown that clostridial GDHt, a key en- 
zyme of the 1,3-PD pathway, is extremely 
oxygen-sensitive [118]. Furthermore, Cha- 
tizfragkou et al. [101] reported that sig- 
nificant differences existed between pro- 
cesses carried out under an atmosphere of 
nitrogen and under self-generated anaero- 
biosis. Whilst, under the latter conditions, 
the production of 1,3-PD was hampered, 
a high lactate dehydrogenase (LDH) activ- 
ity caused the formation of large quanti- 
ties of lactic acid, this effect being espe- 
cially pronounced in small-volume biore- 
actors. 

In contrast, the influence of dissolved 
oxygen on 1,3-PD production was studied 
extensively in Enterobacteriaceae, which 
are facultative anaerobes. Although early 
reports exist [95, 119] on the occurrence 
of GDHt in anaerobic cultures of K. 
pneumonia, it has been shown that a 
limited oxygen supply can have a positive 
effect on the formation of 1,3-PD [112, 
120, 121]. This can be explained by an 
increased biomass yield in microaerated 
cultures, as 1,3-PD is a primary metabolite 
and productivity is biomass-related. The 
oxygen dosing must also be precise, 
since Klebsiella cells in an oxygen-rich 
environment have been shown to shift 
towards acetate and ethanol production 
[112]. 

A recent report was also made concern- 
ing the influence of oxygen on 1,3-PD 
production by L. diolivorans [80], although 
in this case no positive effects of oxygen 
gassing were observed. 


3.5 
Genetic Engineering of 1,3-PD 


The production of 1,3-PD has been ex- 
tensively studied in bacteria, with natural 


producers of 1,3-PD from glycerol being 
either anaerobes (e.g., Clostridia sp., Lacto- 
bacillus sp.) or facultative anaerobes (e.g., 
Klebsiella sp., Citrobacter sp.). In terms of 
bioprocess handling, it might be easier to 
consider facultative anaerobes, although 
these strains are all classified as “op- 
portunistic’” pathogens and consequently 
special safety precautions are required for 
their growth [59]. Among these organ- 
isms, the nonpathogenic C. butyricum and 
pathogenic K. pneumonia are considered 
the best ‘natural producers,’’ and have 
attracted more attention because of their 
appreciable substrate tolerance, yield, and 
productivity [122]. 

The biochemistry of 1,3-PD production 
has been elucidated in detail on the basis of 
the metabolic pathways and enzyme kinet- 
ics involved. Specifically, 1,3-PD is synthe- 
sized in the glycerol dissimilation pathway 
so that, in Nature, it is not produced 
via other metabolic conversions [123]. 
The enzymes associated with glycerol 
metabolism are GDHt, 1,3-propanediol 
oxidoreductase (PDOR), GDH, and dihy- 
droxyacetone phosphate kinase (DHAK). 
Glycerol dissimilation involves two par- 
allel pathways, namely reductive and ox- 
idative. The reductive pathway is carried 
out in two enzymatic steps, in the first 
of which vitamin B,)-dependent GDHt 
removes a water molecule from glycerol 
to form 3-HPA, which is then reduced to 
1,3-PD by a second enzyme, NADH-linked 
PDOR. In contrast, in the oxidative path- 
way glycerol is dehydrogenated to DHA by 
a NAD*-linked GDH, and then to DHAP 
by an ATP-dependent DHAK [40]. 

An analysis of the robustness of the glyc- 
erol dissimilation pathway at four branch 
points (GDHt, PDOR, GDH, DHAK), per- 
formed by Zhang et al. [124], showed 
that partitioning of the carbon flux be- 
tween the reductive and oxidative branches 
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is robust against environmental condi- 
tions. Such tight control at the glycerol 
point is provided by kinetic parameters 
of GDH, GDHt, and 1,3-propanediol de- 
hydrogenase (1,3-PDDH). The rigidity of 
the glycerol branch point implies that an 
improvement in 1,3-PD production by an 
overexpression of the genes involved in the 
glycerol flux partitioning would be a dif- 
ficult task. Similarly, the branch point at 
DHAP, which can enter either the gly- 
colytic or pentose phosphate pathways, 
was shown to be robust against pertur- 
bations and to be largely independent of 
the oxidative branch rate. In contrast, the 
branch points at pyruvate and acetyl-CoA 
were found to be flexible. The metabolism 
of pyruvate and acetyl-CoA also provides 
flexibility for the fine-tuning of energy 
and reducing intermediates fluxes and 
maintaining homeostasis under various 
environmental conditions. Zhang et al. 
suggested that changing the activity of 
enzymes around pyruvate (e.g., pyruvate 
dehydrogenase) and acetyl-CoA would en- 
tail a flux redistribution, which in turn 
would bring about an improvement in 
1,3-PD synthesis [124]. 

The enzymes of the glycerol dissimi- 
lation pathway are encoded by the dha 
operon, which was characterized in K. 
pneumonia [87], C. freundii [125], and C. 
butyricum [118]. GDHt is encoded by 
three open reading frames (ORFs), namely 
dhaB, dhaC, and dhaE (the alternative 
nomenclature is dhaB1, dhaB2, dhaB3), 
which code for subunits a, 6, and y, 
respectively. The dhaT gene codes for 
1,3-PDDH; dhaT from K. pneumonia was 
sequenced and its amino acid sequence 
and structure were deduced [126]. An 
isoenzyme of 1,3-PDDH from E. coli is 
encoded by the yqhD gene [85]. DhaR, 
a transcriptional activator for the dha 
operon, has been reported to induce the 
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expression of DHAK (dhaK, dhaL, dhaM) 
in recombinant E. coli [126], GDH (dhaD) 
in C. freundii [127], and 1,3-PDDH (dhaT) 
in K. pneumonia [67]. The genes dhaF 
and dhaG code for two subunits of the 
reactivating factor (dhaF-dhaG; alterna- 
tive nomenclature GdrA—GdrB) of GDHt. 
Such reactivating proteins were identified 
in C. freundii, K. pneumonia, and K. oxytoca 
[128]. In most wild-type 1,3-PD producers, 
expression of the dha regulon is induced 
by glycerol or DHA, and suppressed by 
catabolite repression and exogenous elec- 
tron acceptors such as oxygen, fumarate, 
or nitrate [87, 129, 130]. However, recent 
findings have indicated that 1,3-PD can 
still be obtained under microaerobic or 
mild aerobic conditions [120, 131-134]. 

In order to metabolically tailor the 
1,3-PD production pathway by metabolic 
engineering approaches, various strate- 
gies have been pursued and applied in 
both native producers and heterologous 
hosts with an acquired ability to produce 
the diol [89]. Multiple studies have been 
designed to overcome the current limita- 
tions of the process, such as knock-out of 
LDH.- or aldehyde dehydrogenase (ALDH)- 
encoding genes to eliminate byproduct for- 
mation, the introduction of a heterologous 
NAD* regeneration system to favorably 
change the redox balance, the utilization 
of By2-independent GDHt in recombi- 
nant E. coli, and the overexpression of 
dhaT (and dhaD) to restrict any accu- 
mulation of 3-HPA. Due to difficulties 
in handling strict anaerobes, and their 
pathogenic potential, the applicability of 
K. pneumonia and C. butyricum in the 
industrial process is limited. E. coli and 
S. cerevisiae are the heterologous hosts of 
choice due to an in-depth knowledge of 
these species acquired over past decades, 
and a general acceptance of their usage 
in industrial processes. Other hosts, such 


as nonconventional yeasts or lactobacilli, 
should be considered for the heterologous 
(or homologous in the case of some of the 
latter) production of 1,3-PD, although an 
inaccessibility of genetic tools may seri- 
ously limit this direction [89]. 

Due to the availability of a rich set 
of genetic tools, and its close relation- 
ship to K. pneumonia, E. coli has become 
the most extensively exploited heterolo- 
gous system for 1,3-PD production. The 
wild-type strain K12 has only a weak ca- 
pacity to produce glycerol, and no capacity 
to produce 1,3-PD; hence, the production 
of 1,3-PD, either from glucose or glycerol, 
relies almost entirely on a heterologous 
pathway. An E. coli strain engineered by 
DuPont and Genencor International, Inc., 
produces 1,3-PD from p-glucose with high 
yield and productivity (1,3-PD at titers of 
over 130g1~') in an aerobic process (se- 
ries of patents and applications: [84—86, 
135, 136]). The base strain, E. coli K12, 
was transformed with the following genes: 
glycerol 3-phosphate dehydrogenase and 
glycerol 3-phosphate phosphatase from S. 
cerevisiae (for conversion DHA phosphate 
glycerol), and GDHt from K. pneumonia 
(glycerol 3-HPA). The pathway was com- 
pleted by endogenous 1,3-PDDH, encoded 
by the yqhD gene. Further improvements 
also included the elimination of any unpro- 
ductive pathways that would lead to a con- 
sumed carbon loss. Both, glycerol kinase 
(glpK) and GDH (gldA) were knocked-out 
to prevent glycerol from entering the cen- 
tral carbon metabolism [135]. 

An interesting approach, demonstrated 
by Pfliigl et al. [107], involved the develop- 
ment of a L. diolivorans strain producing 
1,3-PD from glycerol. Two major findings 
were seen to play a key role in the success- 
ful transformation of this organism: (i) the 
absence of a native plasmid, which would 
be a major obstacle to the transformation 


of L. diolivorans; and (ii) an absence of 
DNA methylation. A suitable expression 
plasmid, pSHM, for homologous and het- 
erologous protein expression in L. dio- 
livorans, was constructed based on the 
replication origin repA of L. diolivorans. 
The native glyceraldehyde-3-phosphate de- 
hydrogenase promoter is used for the 
constitutive expression of genes of inter- 
est. The functional expression of genes 
in L. diolivorans was demonstrated by two 
examples: the production of green fluores- 
cent protein resulted in a 40- to 60-fold 
higher fluorescence of the obtained clones 
compared to the wild-type strain. Finally, 
the homologous overexpression of a pu- 
tatively NADPH-dependent 1,3-PDOR im- 
proved 1,3-PD production by 20% in batch 
cultures [107]. 


4 
2,3-Butanediol 


2,3-BD (2,3-butylene glycol; 
1,3-dimethylene glycol) is an _ alco- 
hol composed of four carbon atoms and 
two hydroxyl groups. It is considered a 
versatile bulk chemical that has applica- 
tions in the manufacture of printing inks, 
perfumes, fumigants, moistening and 
softening agents, explosives, plasticizers, 
and pharmaceuticals carriers [18, 137, 
138]. Especially, the dehydration of 2,3-BD 
into 1,3-butadiene represents a promising 
route for the production of synthetic 
rubber, as the process is independent of 
petroleum [139]. 

2,3-BD can be produced via either chem- 
ical or biotechnological means, although 
due to a gradual exhaustion of crude 
oil reserves interest in its biotechnolog- 
ical production has increased dramati- 
cally in recent years [138, 140]. 2,3-BD 
can be obtained from carbohydrates via 
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a mixed acid fermentation by different 
microorganisms that include (among oth- 
ers) K. pneumonia, K. oxytoca, E. aerogenes, 
Paenibacillus polymyxa, Bacillus amylolique- 
faciens, and Serratia marcescens [138, 141]. 
Recently, the fermentative production of 
2,3-BD by recombinant E. coli has attracted 
much attention [13, 142], as has the use 
of engineered S. cerevisiae, B. licheniformis, 
or B. subtilis, all of which organisms are 
deemed generally regarded as safe (GRAS) 
[18, 140, 143-146]. 


4.1 
2,3-BD Metabolic Pathway 


A variety of monosaccharides (hexoses or 
pentoses) can be fermented to 2,3-BD [8, 
147]. Before generation of the major prod- 
uct and byproducts, the substrate must 
first be converted to pyruvate; if glucose is 
the substrate, pyruvate is formed via the 
Embden—Meyerhof pathway (glycolysis), 
but when pentoses are fermented they are 
formed via combined pentose phosphate 
and Embden-—Meyerhof pathways [148]. 
Pyruvate is then transformed in a mixed 
acid fermentation via several intermedi- 
ate compounds that include a-acetolactate, 
acetoin, and diacetyl [138]. In addition to 
2,3-BD, the other end-products formed 
include organic acids and alcohols (i.e., 
acetate, lactate, formate, succinate, and 
ethanol). The final metabolite profile is 
dependent on the microorganism and con- 
ditions applied [27]. 

A majority of studies concerning 
the fermentative production of 2,3-BD 
have been performed with members 
of the Enterobacteriaceae family. The 
production of 2,3-BD from pyruvate 
in these bacteria involves three key 
enzymes: a-acetolactate synthase (ALS); 
a-acetolactate decarboxylase (ALD); and 
2,3-butanediol dehydrogenase (BDH, also 
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known as acetoin reductase). The pyruvate 
originating from glycolysis is first 
coupled with thiamine pyrophosphate 
(TPP) to form acetyl-TPP, after which 
this compound is condensed with a 
second molecule of pyruvate to yield 
a-acetolactate and release one molecule 
of COo. This reaction is catalyzed by ALS 
[8, 27, 149]. Under anaerobic conditions, 
a-acetolactate can be converted to acetoin 
via a reaction catalyzed by ALD. When 
oxygen is present, a-acetolactate can 
undergo spontaneous decarboxylation, 
producing diacetyl as a minor byproduct, 
and this in turn can be converted to 
acetoin by diacetyl reductase (DAR; also 
called acetoin dehydrogenase). Finally, BDH 
reduces acetoin to 2,3-BD [8, 138, 150]. In 
S. cerevisiae, a-acetolactate can be formed 
from pyruvate by Ilv2p (the isoleucine 
and valine biosynthetic enzyme, ALS) that 
is involved in the isoleucine and valine 
synthesis pathway. In contrast to bacteria, 
a-acetolactate cannot be decarboxylated 
enzymatically into acetoin, as ALD is not 
found in most S. cerevisiae strains. The 
biosynthetic pathway of 2,3-BD formation 
with diacetyl as intermediate is similar 
to that of bacteria [144]. Alternatively, S. 
cerevisiae can synthesize acetoin via the 
condensation of active acetaldehyde with 
acetaldehyde by pyruvate decarboxylase; 
2,3-BD is then formed as a result of 
the reduction of acetoin by BDH [144, 
151]. Unfortunately, the metabolic fluxes 
that lead from pyruvate to 2,3-BD are 
not efficient in wild-type strains of S. 
cerevisiae, and therefore many studies are 
being undertaken to amplify these fluxes 
[18, 144, 145, 151]. 

As the molecule of 2,3-BD contains 
two chiral carbons, three stereoisomers 
can be distinguished: 2R,3R-butanediol 
(p-butanediol), 2.S,3S-butanediol 


(t-butanediol), and —meso-butanediol 


[149]. In general, wild-type strains of 
microorganisms that produce 2,3-BD 
result in a mixture of these forms, though 
the ratio of stereoisomers produced 
can vary dramatically depending on the 
host and the fermentation conditions 
[152]. Enantiomerically pure 2,3-BD 
can be used in the production of 
valuable pharmaceuticals and agricultural 
compounds [153], although separation of 
the isomers from the fermentation broth 
is very expensive [137]. Consequently, 
metabolic pathway engineering methods 
are used to develop strains that produce 
pure 2R,3R- or meso-form of 2,3-BD. 
The majority of these attempts have 
been made in E. coli or B. licheniformis 
[146, 152-154]. 


4.2 
Fermentation 


An increase in the efficiency of the micro- 
bial synthesis of 2,3-BD can be achieved 
by controlling the metabolism of bacte- 
rial cells, in either of two ways. The first 
approach is to modify the environmen- 
tal conditions of the culture, such as pH, 
temperature, or aeration. The second ap- 
proach is to control the composition of the 
culture medium, which in turn affects the 
enzymatic activity of microorganisms. 


4.2.1 Substrates 

The relatively high cost of conventional 
substrates such as starch or sucrose has 
been identified as a major factor affecting 
the economic viability of 2,3-BD fermenta- 
tion [8]. Therefore, attempts to minimize 
the production costs of 2,3-BD have led to 
the development of fermentative methods 
based on low-cost substrates [7]. Finally, 
the production of 2,3-BD from waste prod- 
ucts appears to be an attractive alternative 
for traditional feedstock and, although not 


yet optimized, this will certainly make the 
process economic and feasible. 

To date, mainly renewable agricultural 
resources (starch), food industry residues 
(starch hydrolysates, whey permeate, mo- 
lasses) and lignocellulosic biomass (wood 
and corn cob hydrolysate) have been used 
as substrates for 2,3-BD fermentation. For 
example, Wang et al. [155] reported the use 
of corncob molasses to generate 2,3-BD 
with K. pneumonia. Corncob molasses, a 
waste byproduct of xylitol production, con- 
tains high concentrations of mixed sugars 
that can be used by K. pneumonia to pro- 
duce 2,3-BD in the preferential order: glu- 
cose > arabinose > xylose. As a result, in 
a fed-batch fermentation 78.9 g1-! 2,3-BD 
was obtained within 61h, giving a 2,3-BD 
productivity of 1.3g1-'h! and a yield of 
81.4%. Unfortunately, however, the high 
sugar concentration in corncob molasses 
was shown to have an inhibitory effect on 
both cell growth and BD production. In an- 
other study where molasses was used for 
2,3-BD synthesis, Afschar et al. [156] found 
that K. oxytoca could ferment this substrate 
effectively at high concentrations, with 
quantities of molasses up to 280g1-! be- 
ing converted to 118 g1~! 2,3-BD, and with 
a productivity of 2.4g1-'h7!. It should be 
noted that very little nutrient supplemen- 
tation was required for the conversion of 
molasses to 2,3-BD. 

In another investigation, Perego et al. 
[157] monitored food industry waste for 
2,3-BD production, with substrates in- 
cluding starch hydrolysates derived from 
corn transformation, raw and decol- 
ored molasses from the sugar extrac- 
tion of beets, and whey from cheese 
manufacture. For E. aerogenes NCIMB 
10102 strain, the waste starch hydrolysate 
was found to be the most promis- 
ing raw material, ensuring the high- 
est (0.88 mol, mol,~') and volumetric 
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productivity (0.65 mol, 1~' h~!), which was 
almost twice that estimated for synthetic 
glucose solutions. In contrast, molasses 
were consumed more quickly compared 
to the other substrates, but the prod- 
uct yield obtained was unsatisfactorily 
low (0.38 molp mol,~'). In a subsequent 
study, Perego et al. [158] confirmed the 
use of waste starch hydrolysate for 2,3-BD 
production with B. licheniformis NCIMB 
8059 strain. Notably, there was no need 
to supplement the growth factors in the 
medium for this study, as these were 
clearly present in sufficient quantity in 
the waste material 

The use of whey permeate for 2,3-BD 
fermentation has been widely investigated. 
This byproduct of the dairy industry, which 
contains a high concentration of lactose, 
has attracted much interest as an alter- 
native substrate for 2,3-BD fermentation, 
in attempts to alleviate problems with 
its disposal. In contrast to starch and 
sugar substrates, whey permeate has un- 
fortunately proven to be a relatively poor 
substrate [159, 160]. Nonetheless, Lee and 
Maddox [161] have applied a cell immobi- 
lization technology to 2,3-BD production 
for K. pneumonia, using whey perme- 
ate as substrate, and have successfully 
achieved a higher 2,3-BD productivity. 
Perego et al. [157] also showed that a 
pre-hydrolysis process of whey permeate 
could considerably increase diol produc- 
tivity (to 0.86 mol, mol,~'). 

Another raw material with high poten- 
tial, namely Jerusalem artichoke tubers, 
was used by Sun et al. [162] as a low-cost 
substrate to produce 2,3-BD by K. pneu- 
monia. During a simultaneous sacchari- 
fication and fermentation process (SSF) 
in a fed-batch system, a 2,3-BD yield of 
91.6g1-! was achieved in 40h. In an- 
other study, Li et al. [163] reported that 
the addition of Jerusalem artichoke tuber 
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to a hydrolysate of the stalk increased the 
sugar concentration and thus achieved a 
higher yield of 2,3-BD. Together, these in- 
vestigations represent important progress 
towards using alternative, low-cost sub- 
strates for 2,3-BD production [7]. 

Glycerol, a byproduct of biodiesel pro- 
duction, is also a promising low-cost, non- 
cellulosic substrate that can be used as a 
carbon source for conversion to 2,3-BD. 
Although the major product of glycerol 
fermentations is 1,3-PD [100], Biebl et al. 
[58] have proposed that, under microaero- 
bic and low pH conditions, the production 
of 1,3-PD could be reduced and glycerol 
could be converted to 2,3-BD only. In one 
study, Petrov and Petrova [164] investi- 
gated the main parameters influencing 
the fed-batch culture of K. pneumonia, and 
found pH to be the most important factor 
affecting 2,3-BD production from glycerol 
as sole carbon source. Experiments were 
conducted with noncontrolled pH, start- 
ing at different levels of initial pH, and 
allowed the observation of specific pH 
fluctuations which, in turn, increased the 
2,3-BD yield. In a subsequent study, the 
same authors applied a new method of 
“forced pH fluctuation” for the enhance- 
ment of 2,3-BD production using glycerol 
[164]. In this method, consecutive eleva- 
tions of pH using a definite ApH value 
allowed a significant increase in glycerol 
utilization, and thus a higher production 
of 2,3-BD (namely 70 g1"}). 

The use of a low-cost, lignocellulosic 
biomass as an alternative substrate for 
2,3-BD production has received consid- 
erable attention during the past few years 
[8, 138]. Lignocellulose is the most abun- 
dant biomass on Earth, and due to its 
wide availability and renewable nature it 
has attracted considerable attention as an 
alternative feedstock for bioprocesses and 
biochemicals production [165]. The main 


components of lignocellulose biomass are 
lignin, cellulose, and hemicellulose. Cel- 
lulose is a polymer of glucose, while hemi- 
cellulose is a polymer containing mostly 
pentoses (including xylose, arabinose, and 
ribose) [166]. Thus, an ability to utilize five- 
and six-carbon sugars is required for any 
bacterial strain to be used in this fermen- 
tation processes [167]. 

Corncob is a low-cost, widely available 
agricultural residue derived from corn pro- 
cessing. When its use as a substrate for 
2,3-BD production in a SSF process was in- 
vestigated by Cao et al. [168], it was reported 
that after pretreatment with dilute ammo- 
nia and hydrochloric acid, 90% of the cellu- 
lose in corncob was hydrolyzed to glucose 
and subsequently was fermented to 2,3-BD 
by K. oxytoca strain. A diol concentration 
of 25g1"! and an ethanol concentration 
of 7gl-! were produced from 80g]"! 
corncob cellulose with a cellulase dosage 
of 8.5IFPUg™! corncob. Equally impor- 
tantly, the hemicellulose fraction was pre- 
viously separated from the hydrolysate. In 
order to make the biomass conversion eco- 
nomically feasible, it is essential that the 
hemicellulose fraction also should be ef- 
ficiently converted into 2,3-BD [8]. Cheng 
et al. [169] carried out a detoxification of 
the acid hydrolysate of corncob contain- 
ing a high concentration of hemicellu- 
lose components by sequentially boiling, 
over-liming and adsorbing the hydrolysate 
onto activated charcoal, and then using the 
pentose-rich hydrolysate as a substrate for 
2,3-BD production by K. oxytoca. In this 
way, a maximal 2,3-BD concentration of 
35.7g1-! was achieved, obtained after a 
60h fed-batch fermentation, with a yield 
of 0.5g¢g7! reducing sugar and a produc- 
tivity of 0.59gl-'h~!. Saha and Bothast 
[170], in their study of 2,3-BD synthesis, 
used corn fiber, another hemicellulose rich 
byproduct of corn processing. The acid- 


plus enzyme-saccharified corn fiber was 
fermented by a Enterobacter cloacae strain 
such that 2,3-BD was produced with a 
yield of 0.35gg7! available sugars. The 
same strain was able to produce 2,3-BD 
from dilute acid-pretreated corn fiber by 
SSF, with a yield of 0.34g¢7! theoretical 
sugars. 


4.3 
Fermentation Conditions 


4.3.1 Aeration 

It has been proven that the oxygen supply 
is one of the most important factors in 
2,3-BD fermentation, as it affects produc- 
tivity and also byproduct formation [138, 
171-174]. Notably, bacterial biomass, ac- 
etate and, to a lesser extent, acetoin are 
formed preferably when the supply of oxy- 
gen is high, but when oxygen is limited 
acetoin can be reduced by NADH and thus 
the presence of butanediol is observed. In 
contrast, at a low O2 supply the synthesis of 
lactate and ethanol will predominate. Most 
importantly, this balance can be easily dis- 
torted by the inhibitory effects of some 
byproducts (e.g., acetate, ethanol, and lac- 
tate) on cell growth and 2,3-BD production 
[138, 147]. 

Although butanediol is a product of 
anaerobic metabolism, aeration (precisely 
microaeration) has been shown to increase 
its formation [175-178]. Pirt and Callow 
[178] have suggested that the presence 
of oxygen in a culture medium increases 
butanediol productivity by stimulating bac- 
terial cell growth. Jansen et al. [179] also 
reported that diol synthesis can be max- 
imized by increasing the oxygen supply 
rate, due to a higher cell density devel- 
oping during fermentation, although a 
too-high level of aeration can decrease 
the yield of 2,3-BD. When the O2 sup- 
ply is limited, alternative pathways for 
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substrate utilization (namely “respiration” 
and “fermentation’”’) become active si- 
multaneously, and the yield of 2,3-BD 
will depend on the relative activities of 
these pathways. Consequently, the yield 
of 2,3-BD can be maximized by minimiz- 
ing the O2 supply, because this limits the 
respiration. On the other hand, lowering 
the availability of Oz will lead to a de- 
cline in cell density, and the 2,3-BD yield 
will be reduced as a result of the direct 
relationship between volumetric 2,3-BD 
productivity and biomass concentration [8, 
40, 171]. 

Two sequential phases are noted dur- 
ing the course of 2,3-BD fermentation 
[179]. In the first phase, the oxygen supply 
in the culture is sufficient to maintain a 
high value of dissolved oxygen (DO), such 
that the biomass will increase exponen- 
tially but no butanediol will be formed. 
Consequently, the demands of the cul- 
ture increase and the DO falls to zero. 
When the oxygen supply becomes limit- 
ing (DO < 3% of saturation), the biomass 
will increase linearly rather than exponen- 
tially, which means that the growth rate 
will decrease. During the second phase, 
metabolism of the carbon source can al- 
ternate as a result of oxygen supply (which 
is controlled by the aeration and agitation 
rates) and the oxygen demand of the cul- 
ture (which is controlled by the biomass 
concentration, the culture pH, and the 
presence or absence of growth inhibitors). 
In light of this, the establishment of a suit- 
able oxygen supply control strategy with 
respect to the demands of the culture ap- 
pears to be necessary to ensure efficient 
2,3-BD production [8]. However, it has 
been found that the parameters for 2,3-BD 
synthesis are case-specific and need to be 
determined for a particular bacterial strain, 
media composition, and fermenter design 
and operation. 
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Beronio and Tsao [180], in a 2,3-BD 
fermentation with K. oxytoca, controlled 
only one parameter — namely the oxygen 
transfer rate (OTR) — to maintain growth 
rate and specific oxygen uptake rate. When 
a particular OTR was maintained to keep 
the culture at a constant level of oxygen 
limitation, the final 2,3-BD concentration 
was similar compared to the experiment 
without OTR control, but diol productivity 
was 18% higher in the OTR-controlled 
fermentation. 

Another parameter, namely the volu- 
metric oxygen transfer coefficient (kza), 
directly governs the OTR and was stud- 
ied to explore the characteristics of 
(R,R)-2,3-BD production by Paenibacillus 
polymyxa in response to various oxygen 
supply conditions [181]. For this purpose, 
the programmed k,a change method was 
adapted, and three different parameter 
levels — 40h7! (0-19h); 21h7! (19-41h), 
and 8h-! (41-55h)-were set up to 
control the flow of substrate being ox- 
idized through the aerobic respiratory 
and the anaerobic fermentative pathways. 
Ultimately, this method allowed 44g1-! 
(R,R)-2,3-BD to be obtained with a produc- 
tivity of 0.79 gl hol. 

The influence of oxygen supply and 
dilution rate (D) on the production of 
2,3-BD by E. aerogenes was studied in 
a continuous culture reported by Zeng 
et al. [182]. In this case, different oxygen 
uptake rates (OURs) were generated by 
altering the speed of the impeller with 
constant aeration; the rates were subse- 
quently determined from measurements 
of the oxygen and carbon dioxide contents 
in the effluent gas of the culture. The 
results showed that the optimal OUR in- 
creased with the dilution rate, but as the 
dilution rate increased the yield and prod- 
uct concentration decreased. In this re- 
port, a maximum diol concentration up to 


43917! was obtained at D=0.1h"!, while 
the volumetric productivity was highest 
at D=0.28h"! (5.6g1th"}). It is worth 
noting that product generation was not 
dependent on growth rate, and even at 
low growth rates a higher specific pro- 
ductivity could be expected. Hence, high 
productivities may be achievable by em- 
ploying cell recycling and cell immobiliza- 
tion systems. 

The OUR was also successfully adapted 
as a parameter for the control and devel- 
opment of a two-stage continuous process 
for 2,3-BD production with K. pneumo- 
nia [183]. In the first stage, bacterial 
cells were grown at a high OUR level 
(65mmoll-th~‘), while in the second 
stage fermentation was carried out at a low 
OUR (10mmoll~'h~!). Under such con- 
ditions, 180 g 1~! glucose was converted to 
77 21-1 2,3-BD (+4.4g1-! acetoin) with a 
high productivity of about 2.3 g1-th7!. 

Converti et al. [177] have carried out 
batch system fermentations to evaluate 
and quantify the effect of OUR on glucose 
consumption and product formation 
by E. aerogenes NCIMB 10102. As a 
result, an increase in the percentage 
of glucose addressed to cell growth 
from 4.5% to 59.4% with increasing 
Qo2 up to 72.7mmolo2 C-molpw !h7! 
was observed. In contrast, 2,3-BD 
synthesis was progressively enhanced, 
passing from anaerobic to microaerobic 
conditions, and achieved a maximum 
yield (0.69 mol molg~') at a specific OUR 
of only 46.1 mmolo2 C-molpw th7}, 
after which it rapidly decreased. 
Concurrently, acetoin production was 
enhanced significantly increasing Qo? 
to 59.6mmolo7 C-molpw!h7!, while 
ethanol formation was strongly decreased. 
This may have been the result of a 
NADH )* level shortage due to the 
O,-based removal of NADH)"*. 


In some cases, oxygen can also 
affect the optical purity of the 2,3-BD 
isomers produced [176]. In the chemostat 
culture system with P. polymyxa at 
pH6.3 and a 0.1h! dilution rate, the 
optimum air supply (OUR) for 2,3-BD 
production was 200mlmin~, during 
which the OUR was 6.7mmoll-'h"! 
(2.6mmolg-!drycell-th-! as Qo). 
However, under these conditions the 
optical purity of (R,R)-2,3-BD was 
decreased from 98% to 93% in com- 
promise to anaerobic culture. Further 
increases in the OUR reduced the 
optical purity to 88%. An additional 
experiment was conducted to test the 
possibility of meso-2,3-BD formation 
from diacetyl formed spontaneously by 
the decarboxylation of «a-acetolactate 
under O, presence. For this purpose, 
10mM diacetyl was added to the culture 
medium under anaerobic conditions, 
and although metabolite production 
was much reduced (diacetyl is a strong 
growth inhibitor of P. polymyxa), 
concentrations of the meso form of diol 
were increased and this was noted as a 
major enantiomer of 2,3-BD present in 
the culture medium. 


4.3.2 PH 

The culture pH value has a considerable 
influence on the course of 2,3-BD fermen- 
tation, due to its participation in regu- 
lating bacterial metabolism [147, 171]. In 
general, alkaline conditions promote the 
formation of organic acids, with a simul- 
taneous decline in 2,3-BD synthesis. On 
the other hand, under acidic conditions, 
organic acid production is reduced (over 
10-fold) and diol synthesis is increased (by 
3- to 7-fold). However, the optimum pH 
for 2,3-BD production is heavily depen- 
dent on the microorganism and substrate 
used [138, 184]. 
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Most bacteria can conduct fermenta- 
tion processes, and in effect can gen- 
erate a number of soluble and gaseous 
products, including organic acids [185]. 
However, an increase in acid formation 
during the course of cultivation will lead 
to medium acidification such that, ulti- 
mately, the culture will be inactivated by 
its own products [58]. In light of this, the 
secretion of 2,3-BD into the fermentation 
broth seems to be one of the strategies 
which microbes evolved as an adaptive 
response to pH changes in the envi- 
ronment. Maddox [186] suggested that 
2,3-BD pathway induction is caused by 
an accumulation of acidic products in the 
medium, rather than by altering the in- 
ternal pH. The resulting transmembrane 
pH gradient causes an accumulation of 
acetate, which would induce the enzymes 
involved in 2,3-BD synthesis. Hence, low- 
ering the culture pH would cause an 
increase in the pH gradient, and 2,3-BD 
production would occur before the exter- 
nal pH became too high and the culture 
inactivated. 

Biebl et al. [58] observed the effect of 
pH on 2,3-BD synthesis in K. pneumonia 
strain under conditions of uncontrolled 
pH. In this case, the synthesis of 2,3-BD 
started with some hours delay and was 
connected with reuse of the acetate that 
was formed during the first period. In 
continuous cultures in which the pH was 
lowered stepwise from 7.3 to 5.4, diol 
formation started at pH 6.6 and reached 
a maximum yield at pH 5.5, whereas the 
formation of acetate and ethanol declined 
in this range. The production of 2,3-BD 
was also observed in chemostat cultures 
grown at pH 7 under conditions of glycerol 
excess, but only with low yields. At any 
of the pH values tested, excess glycerol 
in the culture enhanced the butanediol 
yield. 
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Petrov and Petrova [164] also investi- 
gated the effect of pH on 2,3-BD fer- 
mentation by K. pneumonia strain. The 
experimental fermentations were carried 
out with noncontrolled pH, but started 
at various initial levels of this parameter. 
Thus, spontaneous changes of pH and 
product formation were observed. In the 
absence of external maintenance, the mi- 
croorganism attempted to control the pH 
by using acetate/2,3-BD alternations of the 
oxidative pathway of substrate catabolism, 
but this resulted in pH fluctuations ob- 
served during the course of fermentation. 
Therefore, the bacterial culture secreted 
2,3-BD in unequal portions, either allow- 
ing or detaining the acetate synthesis. In 
contrast, when the pH was controlled at 
constant level the yield of 2,3-BD was very 
poor. Moreover, whereas these cultures 
remained viable for only 72h, the pH 
self-controlling cells lived and produced 
2,3-BD for up to 280h. 

In their next study, Petrov and Petrova 
[116] investigated a new process for 2,3-BD 
production from glycerol by K. pneumonia. 
An improvement in productivity over the 
fed-batch system was observed by using a 
new method of “forced pH fluctuation.” It 
was realized that, by consecutively raising 
the pH by definite ApH values, at exact 
time intervals, this would allow multiple 
variations. Consequently, 70g1-! 2,3-BD 
was produced, in contrast to 52.5g1"! 
2,3-BD achieved in a traditional, fed-batch 
fermentation without pH control. The 
forced pH fluctuations emphasized the 
significant role of pH as a governing 
factor in microbial 2,3-BD production 
processes. 

Wong et al. [187] examined the effects of 
pH-controlled and fermentation strategies 
on 2,3-BD production by an isolated 
Klebsiella sp. Zmd30 strain. In this case, 
a pH value of 6.0 was found to be 


optimal among the investigated range of 
4.5 to 9.0. In batch fermentation, the 
concentration, productivity, and yield of 
2,3-BD were 57.17 g1"!, 1.59¢1-'h"!, and 
82%, respectively. 

2,3-BD fermentations using another mi- 
croorganism, namely B. amyloliquefaciens, 
have also been reported [143]. The effects 
of the initial pH (4.5-8.0) of the culture 
medium on glucose utilization and 2,3-BD 
synthesis were investigated in this study. 
Whilst maximum diol production was re- 
ported at pH6.5, 2,3-BD production was 
almost constant within a wide range of 
pH, from 5.5 to 7.5. However, production 
was decreased significantly when the pH 
was outside this range, especially under 
acid conditions (pH <5.5). 

Yang et al. [188] subsequently investi- 
gated the time profiles of pH changes 
during glycerol batch fermentation by B. 
amyloliquefaciens supplemented with glu- 
cose or sucrose, and noted that acetate 
was synthesized more rapidly than 2,3-BD 
in both batches. The rising acid content 
caused an initial drop in pH, but sub- 
sequent 2,3-BD formation reversed the 
intracellular acidification and raised the 
pH to 7.3. Consecutive and gradual pH 
rises and declines were observed as fluctu- 
ations in the pH-time profile. 

For another Bacillus member, P. 
polymyxa, Nakashimada et al. [176] 
reported a maximum 2,3-BD production 
from glucose at pH between 5.7 and 6.3 in 
a chemostat system. B. licheniformis also 
showed a similar optimum pH value for 
the production of 2,3-BD [189]. However, 
diol formation was greatest in the pH 
range of 5.0-5.2 when pentose (xylose) 
was used as the carbon source [190]. 

When Zeng et al. [182] experimentally 
determined the optimum pH value for 
the production of 2,3-BD in a microaer- 
obic continuous culture of E. aerogenes, 


they observed that biomass concentrations 
increased with pH ranging from 5.0 to 
7.0, as the specific ATP requirement of 
the cells decreased. By contrast, in the 
pH range 5.5-6.5 the product concentra- 
tion (2,3-BD+acetoin) was maximal and 
near-constant, although specific produc- 
tion declined continuously with increasing 
pH. Additional experiments in which the 
culture medium was supplemented with 
acetic acid showed that the various effects 
of pH were due to an inhibitory effect of 
the byproduct acetic acid on cell growth. 
The strength of the acid inhibition was 
seen to depend only on the concentration 
of its undissociated form. 

The pH-dependence of conversion yield 
of 2,3-BD may be quite different from 
the temperature-dependence. Perego et al. 
[157], in their study with E. aerogenes, re- 
ported that while this parameter was kept 
almost constant within a narrow range 
of pH (5.5 < pH <6:5), it was sharply de- 
creased either at lower or at higher pH 
values, with the stronger effect being de- 
tected under acid conditions. On this basis, 
these authors claimed that the metabolic 
pathway leading to 2,3-BD formation is 
more sensitive to pH variations than to 
temperature variations. 


4.3.3 Temperature 

The course of 2,3-BD fermentation is 
recognized as a temperature-dependent 
process. It has also been proved that 
the conditions for maximum product 
formation must approximate those for 
maximum biomass yield, as it is gen- 
erally accepted that the production of 
butanediol is a growth-associated phe- 
nomenon. In general, the optimum tem- 
perature for the synthesis of 2,3-BD for 
most bacterial strains is in the range of 
30-35 °C [138, 171, 184, 191]. However, 
as different strains may possess different 
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temperature optima, the optimal value 
should be established individually for each 
strain. 

Ghosh and Swaminathan [192] applied 
an aqueous two-phase (ATP) extractive fer- 
mentation to the production of 2,3-BD 
by K. oxytoca. Optimization of the pro- 
cess parameters (including temperature) 
for the ATP fermentation involved statisti- 
cally designed experiments that employed 
response surface methodology (RSM) and 
provided further information regarding in- 
teractions between the parameters. Hence, 
the optimum process conditions for the 
enhanced production of 2,3-BD were es- 
tablished as: temperature 315°C, pH 6.7, 
and agitation (shaking speed) 172 rpm. 
A K. oxytoca strain was also used by 
Anvari and Motlagh [193] for 2,3-BD pro- 
duction under submerged culture, when 
the optimal process parameters were de- 
termined using the Taguchi method at 
three levels for temperature (37°C), in- 
oculum size (8 g1~'), pH (6.1) and shaking 
speed (150rpm). The optimal combina- 
tions of factors obtained from a proposed 
design of experiment methodology was 
further validated by conducting fermen- 
tation experiments, whereby the results 
obtained showed an enhanced 2,3-BD yield 
of 44%. Cho et al. [194] reported that a 
newly isolated bacterium, K. oxytoca M1, 
produced 2,3-BD or acetoin selectively as 
a major product depending on the tem- 
perature. I details 2,3-BD was synthesis 
preferably (with yield 0.32-0.34¢/g glu- 
cose) 30°C while acetoin was a major 
product (0.32—0.38 g/g glucose) at 37°C. 
The expression level of acetoin reductase 
(AR), which catalyzes the conversion of 
acetoin to 2,3-BD, was also investigated 
and found to be 12.8-fold lower at 37°C 
than at 30°C. 

Biebl et al. [58] noted that lowering the 
temperature from 35 to 30°C in cultures 
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of K. pneumonia resulted in a significant 
reduction in ethanol synthesis such that 
2,3-BD formation predominated. In addi- 
tion, an uncontrolled acidification starting 
at pH >7 and temperature <30°C al- 
lowed a maximum 2,3-BD production to 
be achieved. Yu et al. [195], in their study, 
also used a strain of K. pneumonia to syn- 
thesize 2,3-BD, and set the fermentation 
temperature at 30°C. 

The effect of temperature on 2,3-BD pro- 
duction from crop-biomass cassava pow- 
der by Enterobacter cloacae was assessed 
by Wang et al. [196], who showed 30°C 
to be the most favorable temperature for 
the diol production, cell growth, and glu- 
cose utilization. Under optimal conditions, 
78.3 g1-! of 2,3-BD was produced after 
24 h, whereas at a suboptimal temperature 
the rate of metabolism was decreased. The 
optimum temperature for 2,3-BD produc- 
tion by E. cloacae was also found to be 
30°C in a study conducted by Saha and 
Bothast [170]. 

2,3-BD production by B. licheniformis 
NCIMB8059 was studied by Perego 
et al. [157], who monitored temperature 
(34°C <T<40°C), inoculum _ size 
(0.5 <Xo <10gl1), and starting sub- 
strate concentration (20 < So <70g1"}) 
as factors affecting product formation. 
In effect, the highest butanediol yield 
(0.87 molmol~') was obtained at 
T=37°C, pH6.0, Xo=10gl', and 
So =30g1"! with a cornstarch hydrolysate 
as substrate. In contrast, for P. polymyxa 
(fed-batch or batch fermentation with 
agitation) the temperature was fixed at 
30°C [197]. 


4.3.4. Medium Supplementation with 
Acetic Acid 

Acetic acid, as a byproduct of 2,3-BD 
fermentation, can also induce the activ- 
ity of the three enzymes involved in the 


conversion of pyruvate to 2,3-BD [198]. 
Bryn et al. [199] and Stérmer [200] each 
claimed that an enhanced efficiency of 
pyruvate to 2,3-BD conversion could be 
provided by ionized acetate, as this would 
induce ALD formation and regulate the 
balance between acetoin and 2,3-BD. Yu 
and Saddler [201] noted that, in a fermen- 
tation process with K. pneumonia, acetic 
acid added to a wood hydrolysate at a con- 
centration <1.0% (166.5mM) enhanced 
2,3-BD production by two- to three-fold. 
However, this effect varied with culture 
pH; at pH5.5, <1.0g1"! acetic acid inhib- 
ited product formation, whereas at pH 6.7 
a 10-fold higher concentration was re- 
quired to create the same effect. It was 
subsequently claimed that this inhibition 
was caused by a high proportion of the 
acetic acid being undissociated at low pH, 
as undissociated organic acids are gener- 
ally more toxic towards bacteria than their 
dissociated forms [28]. 


44 
Recovery of Biologically Produced 2,3-BD 


The recovery of 2,3-BD from fermentation 
broths is the primary economic barrier 
in the industrial production of biobased 
2,3-BD [8, 202]. The major difficulties as- 
sociated with 2,3-BD separation are its 
high boiling point (183-184 °C), high hy- 
drophilicity, and the presence of dissolved 
and solid constituents in the fermenta- 
tion mashes [8, 203]. Initially, separation 
techniques such as steam stripping [204], 
pervaporation [205], distillation [206] or re- 
verse osmosis [207] were established for 
the downstream processing of 2,3-BD fer- 
mentation, but the high energy demands 
of these processes caused them to be 
economically unfeasible [208, 209]. Con- 
sequently, alternative and cost-effective 


methods for 2,3-BD recovery are much 
sought after. 

During the past decade, most studies 
on the downstream processing of 2,3-BD 
have focused on different techniques for 
extraction rather than separation. Com- 
pared to above-mentioned separation tech- 
niques, extraction demonstrates several 
clear advantages such as a large through- 
put and low energy consumption [210]. 
Liquid—liquid extraction has been used 
to recover 2,3-BD in several studies [137, 
203, 211, 212], although the solvents used 
(including 1-butanol, isobutanol, or oleyl 
alcohol) did not allow high partition coeffi- 
cient values and consequently the yield of 
hydrophilic 2,3-BD obtained from the fer- 
mentation broth did not exceed 75% [137, 
213]. In addition, these methods require 
large amounts of solution [202, 208]. Re- 
active extraction involves the transforma- 
tion of 2,3-BD to a hydrophobic molecule 
without hydroxyl groups, after which it 
is recovered by solvent extraction [202, 
203, 210, 214]. When Li et al. [208] sep- 
arated 2,3-BD from a fermentation broth 
by reactive extraction, using acetaldehyde 
as reactant and cyclohexane as extractant, 
the total yield of 2,3-BD was >90% and 
the mass fraction of 2,3-BD in the final 
product reached 99%. In 2013, Li et al. 
[202, 214] described two different meth- 
ods for the reactive extraction of 2,3-BD, 
using n-butylaldehyde as both the reac- 
tant and the extractant [202, 214]. For both 
methods, the recovery of 2,3-B was >90% 
and the purity of the final product reached 
99%. However, any organic salts formed 
during the fermentation would reduce the 
efficiency of the recovery process [215]. 

An alternative approach to 2,3-BD recov- 
ery is referred to as “‘salting-out extraction” 
(SOE; also known as aqueous two-phase 
extraction; ATPE). This method allows the 
extraction of a hydrophilic target from 
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aqueous solution using an organic solvent 
as the extractant and salt as the salting-out 
reagent [162, 163, 210, 213, 216]. Jiang et al. 
[216] recovered 2,3-BD from fermentation 
broths with a 98.13% yield by using an 
ethanol/dipotassium hydrogen phosphate 
system, whereas Sun et al. [162] useda SOE 
system that consisted of 2-propanol and 
ammonium sulfate and recovered 2,3-BD 
with a yield of 93.7%. Subsequently, Li et al. 
[163] investigated the ethanol/ammonium 
sulfate system for 2,3-BD separation, 
and achieved a recovery for 2,3-BD of 
91.7%. Dai et al. [213] described a sim- 
ple method for the separation of 2,3-BD 
directly from a lignocellulose-derived fer- 
mentation broth. In this case, the SOE 
employed a K,HPO,/ethanol system, and 
a recovery of 2,3-BD was achieved with 
99% yield. Recently, Matsumoto et al. [210] 
reported the downstream processing of 
2,3-BD using an SOE system formed from 
water-miscible organic solvents and inor- 
ganic salts. Among the tested variants, a 
system which utilized tetrahydrofuran as 
the water-miscible solvent and potassium 
carbonate as the inorganic salt showed the 
highest recovery of 2,3-BD, at 92.2%. Al- 
though the main advantages of the SOE 
method are its low consumption of energy 
and solution, large amounts of salts are 
required for an efficient completion of the 
process [163]. 

In conclusion, further developments of 
methods that allow the recovery of 2,3-BD 
from fermentation broths are required in 
order to render the fermentative produc- 
tion of 2,3-BD cost-effective. 


4.5 
Genetic Engineering of 2,3-BD 


Although many microorganism species 
are able to produce 2,3-BD naturally, only a 
few can be considered as having potential 
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for mass production purposes. One such 
strain, K. pneumonia, has been studied ex- 
tensively and used industrially to produce 
2,3-BD. In the central carbon metabolism 
of K. pneumonia, the 2,3-BD synthesis 
pathway is dominated by three essential 
enzymes, namely ALD, ALS, and butane- 
diol dehydrogenase, which are encoded 
by the budA, budB, and budC genes, re- 
spectively. The mechanisms of the three 
enzymes have been characterized with re- 
gards to their function and roles in 2,3-BD 
synthesis and cell growth [217]. However, 
as noted above, the pathogenic charac- 
teristics of K. pneumonia are considered 
to be obstacles that hinder the industrial 
application of this microorganism. 

Jung et al. [218] removed the viru- 
lence factors from three 2,3-BD-producing 
strains, K. pneumonia KCTC 2242, K. oxy- 
toca KCTC1686, and K. oxytoca ATCC 
43863 by employing a site-specific recom- 
bination technique. It has been reported 
that the pathogenicity of Klebsiella species 
is attributable to several genes, among 
which the wabG gene was key due to 
its role in the synthesis of one domain 
of the core lipopolysaccharide (LPS) [219]. 
The technique employed involved gener- 
ating a deletion mutation in the wabG 
gene encoding glucosyltransferase, which 
plays a key role in the synthesis of outer 
core LPSs, by attaching the first outer core 
residue p-GalAp to the O-3 position of 
the 1,p-Heppll residue. When the mor- 
phologies and adhesion properties against 
epithelial cells were investigated, the re- 
sults indicated that wabG mutant strains 
were devoid of the outer core LPS and had 
lost their ability to retain a capsular struc- 
ture. Although growth was not affected by 
disrupting the wabG gene, the production 
of 2,3-BD was decreased from 31.27 to 
22.4491"! in the mutant compared to that 
of the parental strain [218]. 


Another contribution to the metabolic 
engineering of 2,3-BD synthesis was made 
by Mingshou et al. [142], who utilized 
the wabG-deleted K. pneumonia wild-type 
strain to construct seven new mutants. 
Few studies have been focused on the 
cooperative mechanisms of the three en- 
zymes (budA, budB, and budC) which 
play key roles in 2,3-BD synthesis path- 
way, and their mutual interactions. There- 
fore, the K. pneumonia KCTC2242:DwabG 
wild-type strain was utilized to reconstruct 
seven new mutants by single, double, and 
triple overexpression of the three enzymes 
key to this study. A metabolic flux analy- 
sis showed that the seven overexpressed 
mutants all exhibited enhanced 2,3-BD 
production, and the double overexpression 
strain budBA+ produced the highest yield 
(ca. 3.4g1°1) [142]. 

An interesting strategy for the develop- 
ment of 2,3-BD-producing microbes that 
holds GRAS status was presented by Gas- 
par et al. [220], who engineered a L. lac- 
tis strain. For this, it was assumed that 
the manipulation of NADH-dependent 
steps, and particularly disruption of the 
las-located Idh gene in L. lactis, is com- 
mon to engineering strategies envisaging 
an accumulation of reduced end-products 
other than lactate. Given that genetic 
redundancy is often a major cause of 
metabolic instability in engineered strains, 
the aim was to develop a genetically stable 
lactococcal host tuned for the produc- 
tion of reduced compounds. Therefore, 
the IdhB and IdhX genes were sequen- 
tially deleted in L. lactis F110089, a strain 
with a deletion of the Idh gene. The sin- 
gle, double, and triple mutants (F110089, 
FI10089ldhB, and FI10089ldhBldhx, re- 
spectively) showed similar growth pro- 
files and displayed mixed-acid fermen- 
tation, with ethanol being the main re- 
duced end product. Hence, the alcohol 


dehydrogenase-encoding gene (adhE) was 
inactivated in FI10089, but the resulting 
strain reverted to homolactic fermentation 
due to an induction of the IdhB gene. 
The three LDH-deficient mutants were 
selected as a background for the produc- 
tion of mannitol and 2,3-BD. Pathways 
for the biosynthesis of these compounds 
were overexpressed under the control of 
a nisin promoter, and the constructs ana- 
lyzed with respect to growth parameters 
and product yields under anaerobiosis. 
Glucose was efficiently channeled to man- 
nitol (maximal yield, 42%) or to 2,3-BD 
(maximal yield, 67%) [220]. 
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Keywords 


Synthetic biology 
The designing and building of engineered biological systems that process information 
and produce biochemicals, food, and energy. 


Biofuels 
Alternative transportation fuels derived from renewable biological resources. 


Lignocellulosic biomass 
Abundant agricultural and forestry residues composed of mainly lignin, cellulose, and 
hemicellulose. 


Metabolic engineering 
Directed genetic modifications for the improvement and/or design of cells for increase 
of yield, productivity, and cell robustness. 


Feedstock engineering 
Genetic modification of plant-derived feedstock for the easy extraction of fermentable 
sugars. 


Microbial engineering 
Genetic modification of microbes for the production of high yields of biofuels and 
chemicals. 


Hydrolytic enzymes 
Enzymes such as cellulase, xylanase, and B-glucosidase, responsible for the hydrolysis 
of plant biomass. 


Algal fuels 
Biofuels derived from lipid or other components of algae. 


Synthetic Biology in Biofuels Production 


Increasing demand of biofuels is inevitable today considering the adverse impact of 
fossil fuels on environment, issue of its sustainability, rising price, and dependence 
on foreign countries. However, the question remains on how to produce biofuels in 
cost effective manner from the non-food resources. The non-food agricultural and 
forestry residues have recalcitrant biomass that is difficult to hydrolyze via enzymes, 
and presence of non-conventional pentose sugars and inhibitors makes the sugar 
fermentation into ethanol a formidable task. Besides, ethanol has its inherent issue 
of having low energy density and hygroscopic nature, encouraging scientists to look 
for alternative fuels, such as butanol and hydrocarbons. Algae are another non-food 
feedstock that is being explored for fuel production, but its low growth rate and 
low lipid yield in fluctuating environmental growth condition is of great concern. 
Synthetic biology with its new tools and applications is likely to play a central role in 


addressing these issues. 


1 
Introduction 


Demand for biofuels has been increasing 
during the past two decades, mainly due 
to the deleterious impact of fossil fuels 
on the environment. Currently, approx- 
imately 30billion metric tons of carbon 
dioxide are emitted annually worldwide, 
of which three-fourths result from the 
use of fossil fuels [1]. According to Inter- 
governmental Panel on Climate Change 
(IPCC) assessment this, along with other 
greenhouse gases released into the envi- 
ronment, has led to global warming of 
1.1-1.6°F over the past century [2]. Be- 
sides such impact on the environment, 
the sustainability of fossil fuels has al- 
ways been questioned and humankind’s 
dependence on a limited number of fossil 
fuel-producing foreign countries to ful- 
fil its energy needs has become of grave 
global concern. 

The negative issues related to fossil fuel 
use are, however, likely to be mitigated by 
renewable energy, due to its sustainability 
and environment-friendly _ characteris- 
tics. Currently, biofuels represent the 
alternative transportation fuels and are 


mostly derived from plant sources. 
The first-generation biofuels —- namely, 
bioethanol and biodiesel — were produced 
from food materials such as corn starch, 
cane sugar and vegetable oils, and this 
has led to a food versus fuel controversy, 
forcing both scientists and policy-makers 
to seek alternative feedstocks that do not 
interfere with the human food chain. The 
most common feedstocks currently being 
evaluated are agricultural and forestry 
residues for ethanol production, and 
algal lipids for biodiesel production. The 
process of extracting fermentable sugars 
from agricultural and forestry residues 
and converting them into ethanol has, 
however, been found to be a herculean 
task due to the recalcitrant nature of the 
biomass and the presence of pentose 
sugars that cannot be fermented by yeasts 
that are used traditionally in the beverage 
industry [3]. Likewise, the slow rate of 
algal growth and lipid production, in 
association with the logistics of growing 
these organisms in open ponds and 
their subsequent harvesting, have led 
to the introduction of bottlenecks into 
realizing the commercial success of algal 
biofuels [4]. Ethanol, in particular, has 
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its own limitations in terms of its low 
energy density, its hygroscopic nature 
and engine-corrosive properties, and the 
need for it to be blended with fossil fuels 
in order to be used in normal engines. 
This has led to a quest for fuel molecules 
beyond ethanol, with alternatives to date 
including butanol and other short-chain 
alcohols, alkanes/alkenes, and other 
hydrocarbons. 

Synthetic biology is likely to play a cen- 
tral role in ensuring that a successful 
creation of biofuels from nonfood ma- 
terials is achieved (Fig. 1). Such a role 
may start from the feedstock itself, where 
the recalcitrant nature of the biomass and 
its cellulose content could be improved, 
whether by engineering efficient enzymes 
to hydrolyze the biomass and/or to en- 
gineer microbes capable of utilizing all 
components of the biomass and convert- 
ing them into advanced fuels and other 
useful products. The economics of the 
process will play a major role in deciding 
whether biofuels from nonfood material 
will become a commercial reality, and syn- 
thetic biology may play an important role 
in most steps of this process. 

Synthetic biology is, basically, an expan- 
sion of biotechnology with the ultimate 
goals of being able to design and build 
engineered biological systems that can 
process information, produce biochemi- 
cals and energy, provide food, and main- 
tain and enhance human health and the 
environment [3, 5, 6]. Synthetic biology 
advances the capabilities for engineering 
biological systems by employing engineer- 
ing principles and novel biological tools. 
Indeed, the engineering of biological sys- 
tems has enormous potential to reshape 
the world in a variety of areas, includ- 
ing the sustainability of all systems capa- 
ble of manufacture at both macro- and 
micro-levels, as well as addressing various 
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issues in health and general medicine. 
Some of these applications, and the tools 
of synthetic biology, are discussed in the 
following sections. 


2 
Synthetic Biology: The Tools and 
Applications 


2.1 
Fabrication of Genes 


The reconstruction of completely or par- 
tially synthetic metabolic pathways can be 
achieved via the stable assembly and trans- 
formation of heterologous DNA segments 
in a selected host system, such as Sac- 
charomyces cerevisiae [7-10]. The different 
methods used to fabricate genes for the 
reconstruction of pathways can be catego- 
rized into three groups: (i) gene fabrication 
based on digestion and ligation; (ii) gene 
fabrication based on in-vitro homologous 
recombination; and (iii) gene fabrication 
based on in-vivo homologous recombina- 
tion.The first of these methods was used 
successfully to reconstruct the bacterial 
methyl erythritol-4-phosphate (MEP) path- 
way in S. cerevisiae by incorporating seven 
enzymatic steps of the pathway onto a 
yeast episomal plasmid (YEP) [11]. Several 
pathways have also been reconstructed in 
S. cerevisiae by utilizing in-vitro recombina- 
tion through yeast artificial chromosomes 
(YACs) [12], and the reconstruction of 
the flavonoid pathway in S. cerevisiae us- 
ing YAC is an important example of this 
[13]. The integration of a ‘‘foreign’” DNA 
into the yeast chromosomes by utilizing 
an in-vivo homologous recombination sys- 
tem of yeast offers the best means of 
reconstructing the pathway in synthetic 
biology. Yeast integrative plasmids (YIPs) 
have the potential to integrate into the 
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yeast chromosome, and can be used to 
perform genetic manipulations in yeasts 
[14-17]. Another important tool which 
is employed in S. cerevisiae and based 
on homologous recombination, is that 
of transformation-associated recombina- 
tion (TAR), which involves a homologous 
recombination during yeast spheroplast 
transformation between genomic DNA 
and a TAR vector containing foreign DNA 
[18]. This provides a flexible and efficient 
method that allows the transfer of up to 
250 kb of selective DNA. 


2.2 
Biological Circuits 


Synthetic biological circuits represent the 
application of synthetic biology for design- 
ing a biological system in such a way that 
electronic networks are mimicked to per- 
form logical biological functions inside the 
cell. Biological circuits are composed of re- 
pressor or promoter elements to facilitate 
either the creation of a product or the in- 
hibition of a competing pathway. These 
circuits can be used to modify cellular 
functions, to create cellular responses to 
environmental conditions, or to influence 
cellular development by implementing ra- 
tional, controllable logic elements in cellu- 
lar systems. Their applications range from 
simply inducing production to adding a 
measurable element, such as green flu- 
orescent protein (GFP), to an existing 
natural biological circuit to implementing 
completely new systems of many parts. 
The creation of a “bistable” switch in Es- 
cherichia coli is an important example of a 
synthetic biological circuit [19], where the 
switch is turned on by heating the bacte- 
rial culture and turned off by the addition 
of isopropyl -D-1-thiogalactopyranoside 
(IPTG), using GFP as a reporter for the sys- 
tem. Similarly, several biological circuits 


have been devised that help to modulate bi- 
ological systems for certain objectives, and 
also help to provide a better understanding 
of the biological system. 


2:3 
Metabolic Engineering 


Metabolic engineering, another important 
application of synthetic biology, is the prac- 
tice of optimizing genetic and regulatory 
processes within cells to increase the pro- 
duction of certain substances by those 
cells. In metabolic engineering, instead 
of directly deleting and/or overexpress- 
ing the genes that encode for metabolic 
enzymes, the current focus is to target 
the regulatory networks in a cell so as to 
efficiently engineer the metabolism [20]. 
According to the Biotechnology Indus- 
try Organization, more than 50 biore- 
fineries are currently being built across 
North America to test and refine technolo- 
gies for the production of biofuels and 
chemicals from renewable biomass, all of 
which will help to reduce “greenhouse 
gas” emissions [21]. Potential biofuels in- 
clude those produced from metabolically 
engineered microbes, such as short-chain 
alcohols and alkanes to replace gasoline, 
fatty acid methyl esters (FAMEs), fatty al- 
cohols, fatty acids, and isoprenoid-based 
biofuels to replace diesel oil [22]. To date, 
innovations in metabolic engineering have 
offered a variety of new abilities to mi- 
croorganisms which are not inherent to 
them. Examples include the production 
of artemisinic acid in engineered yeast 
[9], the production of n-butanol in S. 
cerevisiae [23], enhancements in the pro- 
duction of fatty acid-derived biofuels by 
using a dynamic sensor regulator sys- 
tem in E. coli [24], and the modulation 
of metabolic flux using synthetic protein 
scaffolds [25]. 


2.4 
Predicting Fabricated Genes or Pathways 
Behavior in a Biological System 


Before the addition of any heterologous 
gene or pathway to a biological sys- 
tem, it is always advantageous to pre- 
dict how this will affect the physiol- 
ogy or behavior of the system. Notably, 
this will help to prevent consistent ef- 
forts, and to save huge amounts of re- 
sources and precious time in metabolic 
engineering. Synthetic biologists obtain 
this information from metabolic network 
modeling, which allows a better predic- 
tion of system behavior prior to its fab- 
rication. Such modeling will also help 
the synthetic biologist to better under- 
stand how biological molecules bind sub- 
strates and catalyze reactions, to real- 
ize how DNA encodes the information, 
and to note how multicomponent inte- 
grated systems behave. The construction 
of a metabolic network model and its 
analysis are discussed in the following 
section. 


2.4.1 Metabolic Network Modeling 

The fundamental requirement for 
metabolic network modeling (which is 
also referred to as metabolic reconstruction 
and simulation) is a huge amount of 
information pertaining to the physiology, 
biochemistry, and genetics of the 
target organism. In other words, a 
reconstruction will collect all of the 
relevant metabolic information for 
an organism and compile it into a 
mathematical model. The subsequent 
validation and analysis of a reconstruction 
can allow the identification of key 
features of metabolism, such as growth 
yield, resource distribution, network 
robustness, and gene essentiality. This 
approach allows an in-depth insight into 
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the molecular mechanisms of a particular 
organism, and can be applied to create 
a novel biotechnology or a predictive 
capacity. A network can also be visualized 
as a graph in which biological entities 
such as genes, transcripts, proteins and 
metabolites correspond to nodes, while 
the interactions between nodes — such 
as coexpression and _ protein-protein 
interactions — correspond to edges. 

Genomic information gathered as a 
result of the advent of next-generation 
sequencing has revolutionized metabolic 
reconstruction, especially genome-scale 
metabolic reconstructions. This involved 
the integration of biochemical metabolic 
pathways with rapidly available, unanno- 
tated genome sequences. Several bioin- 
formatics tools, such as Model SEED, 
ERGO, Pathway Tools, and MetaMerge, 
are currently available to help in automat- 
ing the reconstruction process. These tools 
collect basic information from different 
databases, including Kyoto Encyclopedia 
of Genes and Genomes (KEGG), BioCyc, 
EcoCyc, MetaCyc, ENZYME, BRENDA, 
BiGG, and metaTIGER. As the genomes 
of increasing numbers of organisms have 
become completely sequenced, in-silico 
genome-scale metabolic models have been 
constructed for several organisms in the 
domains of bacteria, archaea and eukarya, 
so that they can be used to explore their 
metabolic characteristics at systems level 
[26-30]. 


2.4.2. Analysis of the Model 

Fundamental knowledge of the intra- 
cellular distribution of carbon fluxes 
and their regulation in the metabolism 
has played an important role in under- 
standing cellular physiology and predict- 
ing its metabolic capability under speci- 
fied environmental or genetic conditions 
[31-34]. Currently, two methods based 
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on optimization techniques are used to 
analyze the metabolic network model, 
namely constraints-based flux analysis and 
13C-based flux analysis. 


Constraints-Based Flux Analysis 
Constraints-based flux analysis is used 
to analyze cellular metabolism under 
a specified environmental or genetic 
condition, and to predict metabolic 
capability when the specified conditions 
are perturbed [35]. The constraints-based 
flux analysis consists of two parts: (i) 
objective functions; and (ii) constraints to 
metabolic fluxes in the metabolic model 
[36, 37]. Constraints are the conditions 
that must be satisfied while solving for the 
optimal solution to the metabolic network 
by maximizing/minimizing the objective 
function(s). This method has been used 
successfully to improve the metabolic 
capability for the overproduction of 
industrially important products, including 
biofuels [30, 32, 35], and also for gene 
knockout investigations. 


13C.Based Flux Analysis In 1C-based 
flux analysis, C-labeled substrates are 
used to grow cultures and to measure 
the distribution of isotope-labeled carbon 
throughout the metabolic network. The 
techniques generally used to measure the 
13C-enrichment patterns of metabolites in- 
clude nuclear magnetic resonance (NMR) 
and gas chromatography—mass spectrom- 
etry (GC-MS) [33, 34]. The intracellular 
fluxes are then estimated by fitting itera- 
tively the simulated fluxes in stoichiomet- 
ric models to the measured data [33, 38]. 
This method has also been used to discover 
and quantify the in-vivo operation of un- 
usual pathways within complex metabolic 
networks, and to elucidate the pathways in 
less-characterized species [33, 39-41]. 


3 
Role of Synthetic Biology in Feedstock 
Improvement 


As noted in Section 1, lignocellulosic 
biomass obtained as the unusable por- 
tion of a plant biomass in the form of 
agricultural, industrial, domestic and for- 
est residues, as well as from dedicated 
energy crops such as poplar, switch grass, 
Miscanthus sp. and others that can be 
grown on marginal lands, has the po- 
tential to become an alternative source 
for fuels because of its greater availability 
at the global level [42, 43]. Yet, unfortu- 
nately these feedstocks often cannot be 
converted efficiently into biofuels and so 
are often wasted. The major limitation 
here is based on the high production costs 
associated with a need for large quantities 
of expensive hydrolyzing enzymes, as well 
as the pretreatment processes required to 
achieve an efficient hydrolysis of the lig- 
nocellulosic biomass. Consequently, syn- 
thetic biologists have recently expended 
much effort in making fuel production 
from lignocellulosic biomass economically 
feasible (see below). 


3.1 
Feedstock Engineering for Efficient 
Hydrolysis 


The lignocellulosic biomass consists basi- 
cally of cellulose, hemicelluloses, lignin, 
and a small amount of pectin. Cellulose 
and hemicelluloses, as the key polymers, 
are hydrolyzed to create monomers that 
eventually undergo fermentation to pro- 
duce biofuels. The efficiency of the hy- 
drolysis, which is catalyzed by several 
enzymes, depends on enzyme specificity 
(e.g., specific activity, pH optima, ther- 
mostability) and substrate accessibility. 
The cell wall component responsible for 
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Fig. 2. Feedstock engineering for cost-competitive fermentable sugar for biofuel production. 


restricting accessibility to the hydrolytic 
enzyme is lignin [44] which, due to its het- 
erogeneous nature, cannot be adequately 
or completely degraded by a single en- 
zyme [45]. Consequently, by reducing the 
proportion of lignin in the cell wall and 
modulating its structure such that it is 
more easily degraded, accessibility to the 
enzyme can be improved and the process- 
ing costs reduced (Fig. 2). 


3.1.1 Feedstock Engineering for Low 
Lignin Content 

Efforts to engineer plants that produce 
less lignin have mainly been focused on 
modulating the key enzymes involved in 
the biosynthetic pathways of the monolig- 
nols and, eventually, lignin biosynthesis. 
Chen and Dixon [44] downregulated six 
genes involved in the lignin biosynthetic 
pathways by using antisense approaches 


in transgenic Alfalfa. Subsequently, by 
comparing saccharification efficiencies, 
the suppression of genes involved in the 
early stages was found to be most effec- 
tive in reducing the lignin content. Indeed, 
in some cases the lignin content was re- 
duced by half compared to that in wild-type 
plants. The extent of cellulose-mediated di- 
gestion of the untreated stems from two 
classes of transgenic Alfalfa was compa- 
rable to that observed for the digestion of 
lignin-free, microcrystalline cellulose. 

In another study, the downregulation 
of hydroxycinnamoyl-CoA:NADPH 
oxidoreductase (CCR) in poplar (Populus) 
was found to result in a more digestible 
cellulose by Clostridium cellulolyticum and 
produced twice the amount of fermentable 
sugar [46]. The downregulation of hy- 
droxycinnamate CoA/5-hydroxyferuloyl 
CoA-ligase or 4CL in transgenic quaking 
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aspen (Populus tremuloides) resulted in a 
45% decrease in lignin content, with a 
concomitant 15% increase in cellulose 
[47]. It is believed that such compensation 
occurred because quantitative or qualita- 
tive changes in one cell-wall component 
can often result in changes to other 
cell-wall components [48]. 

Decreases in lignin content have also 
been reported following the downregula- 
tion of phenyl ammonia lyase (PAL), which 
is the master enzyme responsible for the 
downstream regulation of the whole lignin 
biosynthesis flux. The decrease in lignin 
content was seen to depend on the degree 
of PAL suppression [49, 50]. 

Lignin, as an important component of 
plants, provides the plant with mechanical 
strength as well as an ability to defend 
against invading pathogens. Transgenic 
plants with a greatly reduced lignin con- 
tent were far shorter, such that the overall 
biomass production was much reduced 
[51, 52], and confirming that lignin engi- 
neering must be conducted with utmost 
caution. 


3.1.2 Change in Lignin Structure 

Lignin is a heteropolymer formed 
mainly from the monolignols coniferyl 
and sinapyl alcohol and, to a lesser 
extent, from p-coumaryl alcohol; this 
results in guaiacyl (G), syringyl (S) 
and p-hydroxyphenyl (H) units being 
incorporated into the lignin polymer, 
respectively. Each of these residues 
results from separate, but interconnected, 
biosynthetic pathways, the manipulation 
of which is expected to modify the plant’s 
lignin. On this basis, lignin biosynthesis 
pathways have been engineered to 
alter the proportions of the different 
monomers, the aim being to make 
digestion of the molecule more amenable. 
A change in lignin structure has also been 


observed as a result of the downregulation 
of 4-hydroxycinnamate 3-hydroxylase or 
C3H. Predictably, this led to an increase 
in the proportion of p-hydroxyphenyl 
units relative to the normally dominant 
G:S ratio [53]. In another study, the 
downregulation of cinnamyl alcohol 
dehydrogenase in poplar (Populus) 
caused an increase in less-conventional 
syringyl units and B-O-4-bonds, and a 
greater number of free phenolic groups 
[54]. 

In Nature, lignin contains ether bonds 
that are difficult to degrade, and genetic 
engineering has been applied to introduce 
ester bonds into the lignin backbone 
which are more easily broken down in 
chemical fashion. The gene isolated from 
the Dong Quai (Angelica sinensis) plant and 
integrated into the poplar tree genome 
produced an exotic monomer that became 
incorporated into the lignin chain. The 
resultant lignin linkage, termed an ester 
linkage, was shown to be highly amenable 
to degradation [55]. 


3.2 
Feedstock Engineering for Increased 
Biomass 


Another important approach for feedstock 
engineering is focused on enhancing the 
polysaccharide production that, eventu- 
ally, leads to an increase in the overall crop 
biomass (Fig. 3). An increased biomass 
will increase the availability of raw ma- 
terials and, ultimately, decrease the cost 
of biofuel production. The recent progress 
made in this field is described in the fol- 
lowing sections. 


3.2.1 Delay in Reproductive Phase 
A floral repressor gene, Flowering Locus 
C (flc), that causes a prolonged vegetative 
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growth stage has been identified in Ara- 
bidopsis [56]. It was observed that plants 
overexpressing the flc gene prolonged their 
vegetative growth phase unless they were 
exposed to vernalization [56, 57]. Nuclear 
transformation of this gene in tobacco de- 
layed flowering by three weeks and signifi- 
cantly increased transgenic plant biomass 
at the greenhouse level [58]. 


3.2.2 Genetic Manipulation of Plant 
Growth Regulators 

The regulation of certain growth regula- 
tors, such as brassinosteroids, has been 
reported to increase plant biomass without 
the need for increased fertilizer applica- 
tions [59]. In another study, increased 
gibberellins biosynthesis in a transgenic 


hybrid poplar promoted plant growth and 
biomass [60]. 


3.2.3. Modulation of Nutrient Metabolism 
The chloroplastic fructose-1,6-bisphos- 
phatase (FBPase) is known to have a key 
role in CO, assimilation, and in coordi- 
nating carbon and nitrogen metabolism 
to increase sucrose production. When the 
pea FBPase gene was downregulated in 
transgenic Arabidopsis, the lower levels 
of FBPase production resulted in an 
increased production of sucrose [61]. 


3.2.4 Facilitation of Phosphorus 
Utilization 

An increase in biomass through increas- 
ing the availability of key nutrients has 
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also been reported. Phosphorus is one 
of the least available nutrients in soil, 
yet has an important role in photosyn- 
thesis, respiration, and the regulation of 
many enzymes. Expression of the Med- 
icago truncatula gene for purple acid 
phosphatase (MtPAP1) in transgenic Ara- 
bidopsis resulted in a twofold increase 
in biomass production when 2mM phy- 
tate was supplied as the sole source of 
phosphorus in soil [62]. This effect was 
most likely due to the auxiliary role of 
MtPAP1 in the utilization of the ex- 
ogenous phytate, which increases the 
availability of phosphorus to transgenic 
plants. 


3:3 
Feedstock Engineering for Heterologous 
Expression of Hydrolyzing Enzyme 


The heterologous expression _— of 
biomass-hydrolyzing enzymes in plants 
would, in theory, abolish the need for a 
hydrolyzing enzyme in the downstream 
step for the extraction of monomeric 
sugars. However, harsh conditions 
provided to the feedstock during the 
thermochemical pretreatment process 
decreased the catalytic efficiency of 
the enzymes expressed in the plant 
[63]. A biologically active, thermostable 
endo-1,4-B-endoglucanase (E1) enzyme 
of Acidothermus cellulolyticus has been 
successfully expressed in rice and corn 
stovers [64]. Moreover, production of 
the enzyme caused no apparent harm 
to the plants’ normal growth and 
development [65]. In the transgenic 
corn and rice, El, with the addition 
of f-glucosidase (Novozyme 188, St. 
Louis, MO, USA), successfully converted 
30% of corn stover and rice straw into 
glucose. 


4 
Enzyme Engineering for Hydrolysis of 
Biomass 


A low specific activity, poor thermostabil- 
ity, low pH optima, and high production 
costs are important limitations of the en- 
zymes used in the degradation of lignocel- 
lulosic biomass (Fig. 4). Recent progress 
and the challenges in enzyme engineering 
to address these concerns are discussed in 
the following sections. 


4.1 
Enzymes with Higher Catalytic or Specific 
Activity 


In efforts to improve the yield and rate 
of the enzymatic hydrolysis, investigations 
have been focused on optimizing the hy- 
drolysis process and enhancing cellulase 
activity [66, 67]. A high-throughput selec- 
tion method to improve endoglucanase 
activity was developed by adapting chemi- 
cal complementation to provide a growth 
assay for bond cleavage reactions. This 
method helped in the identification of two 
variants with improved catalytic efficiency 
(3.7- and 5.7-fold) from a DNA shuffling 
library with a size of 10° [68]. 


4.2 
Enzymes with Improved Thermostability 
and pH Optima 


Fifteen highly diverse thermostable cel- 
lobiohydrolase hybrids (up to 7 °C higher 
stability than the most thermostable 
parent) have been constructed by 
screening 73 variants, using a SCHEMA 
structure-guided recombination method 
[69]. This group of diverse cellobiohy- 
drolases provides a better platform for 
improving their catalytic efficiency. It has 
also been reported that an endoglucanase 
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Fig. 4. Enzyme engineering for cost-competitive lignocellulosic biofuel production. 


expressed in chloroplasts had a higher 
temperature stability and a wider pH 
optima than when the enzyme was 
expressed in E. coli [70]. 


4.3 
Enzymes with Bifunctional or 
Multifunctional Activities 


Bifunctional enzyme increases the 
efficiency of enzyme by increasing 
the substrate range. A_ bifunctional 
chimeric protein that was designed 
and constructed as a result of fusion 
of §-1,4-endoglucanase (Endo5A) and 
6-1,4-endoxylanase (Xyl11D) enzyme of 
the Paenibacillus sp. strain MTCC 5639 
has been reported [71]. A model of four 
chimeric proteins was generated by fusion 
of the enzyme-encoding gene either 
end-to-end or through a glycine-serine 
(GS) linker. The three-dimensional (3D) 
structures of the four models were 


predicted using the I-TASSER (Iterative 
Threading ASSEmbly Refinement) 
server, and their secondary structures 
analyzed using circular dichroism (CD) 
spectroscopy. The chimeric model 
Endo5A-GS-Xyl11D, in which a linker 
separated the two enzymes, yielded the 
highest C-score and exhibited secondary 
structure properties closest to the 
native enzymes. This chimeric enzyme 
demonstrated 1.6- and 2.3-fold higher 
enzyme activities for endoglucanase and 
xylanase, respectively, than the individual 


EndoSA and Xyl11D. 
In another study, six bifunctional 
chimeric proteins were constructed 


based on the fusion of EndoS5A and 
GluclC enzymes of the same strain of 
Paenibacillus sp. by varying the positions 
of the enzymes and the size of the linkers 
[72]. One of these constructs, EG5, which 
consisted of Endo5A-(G4S)3-Gluc1C, 
demonstrated 3.2- and 2-fold higher molar 
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specific activities for B-glucosidase and 
endoglucanase, respectively, than Gluc1C 
and EndoSA alone. 


5 
Microbial Engineering to Expand the 
Substrate Range 


One of the major points which makes 
the fermentation of biomass-derived sugar 
difficult when compared to corn starch 
or cane sugars is the diversity of the 
sugars. Lignocellulosic biomass is com- 
posed of polymers of C5 (pentose) and 
C6 (hexose) sugars such as glucose, xy- 
lose, arabinose, and galactose. For agricul- 
tural residues such as rice straw, wheat 
straw, corn-cobs and sugar cane bagasse, 
the pentose sugar is a major compo- 
nent (30-40%) of the total sugars [73]. 
The traditional ethanol-fermenting yeast 
S. cerevisiae cannot ferment pentose sug- 
ars, and consequently several attempts 
have been made to engineer ethanologenic 
hosts (especially S. cerevisiae) to expand 
their substrate-utilizing capability. Two 
xylose-assimilating pathways from heterol- 
ogous sources have been introduced into 
S. cerevisiae. In the first pathway, xylose 
isomerase (XI/XylA) from various bacte- 
ria and anaerobic fungi were expressed 
in S. cerevisiae to convert xylose into xy- 
lulose. Xylulose was further converted to 
xylulose-5P, an intermediate of the pentose 
phosphate pathway (PPP), by overexpress- 
ing endogenous or heterologous xylulose 
kinase (XKS1/XYL3) [74]. In the second 
pathway, xylose is converted to xylulose 
via a xylitol intermediate with the help of 
two heterologous enzymes, xylose reduc- 
tase (XR/XYL1) and xylitol dehydrogenase 
(XDH/XYL2). The XR/XDH pathway has 
two intrinsic issues as compared to XI 
pathway, that is, a cofactor imbalance since 
XR uses NADPH while XDH uses NAD* 


as cofactor, and the secretion of interme- 
diate xylitol. Extensive efforts have been 
made to optimize the XR/XDH pathway as 
it has thermodynamic advantage over the 
XI pathway, leading to a faster xylose uti- 
lization and ethanol production [75]. The 
major focus in optimizing the pathway 
has been to identify mutant XR and XDH 
that can bind one type of cofactor, either 
NADH/NAD* specific or NADPH/NAD* 
specific, and to balance the relative ac- 
tivities of XR, XDH, and XKS1 so that 
intermediate xylitol does not squeeze out 
of the cells [76-79]. Endogenous path- 
ways, mainly involving the PPP, have been 
perturbed to improve the xylose assim- 
ilation rate [79-83]. With the advent of 
next-generation genome sequencing tech- 
niques, an inverse metabolic engineering 
technique has been adopted where S. cere- 
visiae harboring either the XR/XDH/XK or 
XI1/XK pathway has been adapted for faster 
growth on xylose, while genes that were 
mutated were analyzed via genome se- 
quencing and used for further engineering 
the microbes [74, 84]. A further challenge 
faced by the xylose-fermenting yeast was 
the sequential utilization of glucose and 
xylose. Typically, the glucose fermenta- 
tion rate is ~3- to 10-fold higher than 
the xylose fermentation rate, and xylose 
uptake is repressed in the presence of 
glucose. Although several strategies were 
adopted to address this issue, the most 
prominent one involved the use of cel- 
lobiose (a dimer of glucose and product 
of cellobiohydrolase from cellulose) being 
used along with xylose instead of glu- 
cose. A xylose-fermenting S. cerevisiae was 
used to express f-glucosidase, an enzyme 
that hydrolyzes cellobiose to glucose, ei- 
ther on the cell’s surface or intracellularly 
along with cellobiose transporter [85-87]. 
In either case, cellobiose and xylose were 
utilized simultaneously and ethanol was 


produced at high yield. Besides S. cere- 
visiae, another ethanol-producing microbe, 
Zymomonas mobilis, has also been engi- 
neered for widening its substrate range to 
utilize xylose and mannose along with glu- 
cose. Here, mannose-utilizing (manA) and 
xylose-utilizing (xylA, xylB, tal, and tktA) 
genes from E. coli were introduced into Z. 
mobilis, after which the engineered strain 
was shown to produce ethanol from acid 
hydrolysate continuously for 10 days [88]. 
Another abundant feedstock that has 
been explored for biofuel production is 
brown algae (seaweeds), which does not 
require arable land, fertilizers or fresh wa- 
ter resources for its growth, and therefore 
avoids any adverse impact on food supplies 
[89]. The most abundant sugars in brown 
algae are alginate, mannitol, and glucan. 
Alginate is a linear block copolymer of 
B-D-mannuronate (M) and a-1-guluronate 
(G). A 36-kb DNA fragment from Vibrio 
splendidus encoding enzymes for alginate 
transport and metabolism was character- 
ized and integrated into the E. coli genome. 
Further engineering of E. coli with the 
ethanol pathway enabled the engineered 
platform to produce ethanol directly from 
macroalgae via a consolidated process 
[89]. The further discovery of an alginate 
monomer (4-deoxy-L-erythro-5-hexoselose 
urinate; DEHU) transporter from the al- 
ginolytic eukaryote Asteromyces cruciatus 
led to the engineering of an industrial mi- 
crobe S. cerevisiae for the production of 
ethanol from mannitol and DEHU [90]. 


6 
Microbial Engineering to Produce Various 
Biofuel Molecules 


Microbial engineering is an important 
segment of the overall process that will 
determine how carbon sequestered by the 
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photosynthetic organism in the form of 
sugars can be converted to various biofuel 
molecules. The possible biofuel molecules 
available, together with the metabolic 
pathways involved, are summarized on 
Fig. 5 and discussed in the following 
sections. 


6.1 
Lignocellulosic Ethanol 


Ethanol is a major fermentation-derived 
biofuel that is available commercially to- 
day and is mainly produced by the robust 
S. cerevisiae [91]. A problem arises, how- 
ever, when lignocellulosic biomass is to 
be used as the carbon source because 
such biomass generally contains differ- 
ent types of sugars, along with various 
inhibitory compounds that are generated 
during the pretreatment process. Hence, 
most metabolic engineering efforts in the 
field of ethanologenic fermentations are 
focused either on increasing the tolerance 
to these inhibitors or enabling coutiliza- 
tion of the sugars present. 

A dilute acid pretreatment of the lig- 
nocellulosic biomass in the presence of 
steam not only depolymerizes the biomass 
to release sugars but also produces a 
range of growth- inhibiting molecules 
such as furfural, 5-hydroxymethylfurfural, 
formate, acetate, and soluble lignin [92]. 
Although most organisms contain en- 
zymes that can reduce furfural and 
5-hydroxymethylfurfural to their corre- 
sponding alcohols, their growth is im- 
paired significantly until these compounds 
are completely metabolized. Moreover, as 
the Ky, for NADPH for these enzymes 
is generally very low, a serious NADPH 
deficit is created during the process of fur- 
fural detoxification. Furfural tolerance was 
induced in several ways, such as the dele- 
tion of yqhD [93], overexpression of the 
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transhydrogenase pntAB [94], overexpres- 
sion of an NADH-dependent propanediol 
oxidoreductase fucO [95], and overexpres- 
sion of the cryptic gene ucpA [96]. In an- 
other study conducted by the same group, 
all of these traits were incorporated into 
the E. coli genome, which resulted in an 
increased tolerance for furfural and an 
improved hydrolysate utilization [97]. 


6.2 
Butanol and Higher-Chain Alcohols 


Today, metabolic engineering strategies 
are increasingly employed to produce 
higher-chain alcohols, as these are not 
efficiently created by their native produc- 
ers. Unfortunately, these efforts have not 
resulted in industrially relevant titers or 
productivities [98], mainly because of a 
lack of techniques for genetic modifica- 
tion, slow growth rates, and the complex 
physiology of the native organisms [99]. 
Consequently, many recent synthetic biol- 
ogy efforts have been focused on E. coli and 
S. cerevisiae, primarily because they have 
very well established tools for genetic ma- 
nipulation and also because of their rapid 
growth. 

n-Butanol, a C4 alcohol with a higher en- 
ergy density and lower hygroscopicity than 
ethanol, was one of the first biofuel candi- 
dates to be produced in a non-native host. 
n-Butanol is produced naturally in Clostrid- 
ium acetobutylicum by the fermentation of 
sugars and starch [100], although due to 
the low titer and gradual loss of butanol 
production in continuous fermentations 
this route of production was never applied 
commercially. Basically, two molecules of 
acetyl CoA are condensed to form ace- 
toacetyl CoA, which is then reduced in a 
series of steps to butyryl CoA and finally, 
to butanol. This pathway was reproduced 
in S. cerevisiae, where a butanol titer of 
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only 2.5 mg1"! was observed [23]. Butanol 
was also produced recombinantly in E. coli 
with an initial titer of 13.9mgl"!, which 
was further increased to 552 mg1"! on the 
deletion of genes responsible for the pro- 
duction of the major side products (ie., 
lactate, succinate, ethanol, and acetate), 
and deletion of the fnr gene which inacti- 
vates the pyruvate dehydrogenase complex 
under anaerobic conditions [101]. As the 
enoyl CoA reduction step was indicated 
to be a bottleneck in the pathway, it was 
felt that the presence of an irreversible en- 
zyme catalyzing this reaction would drive 
the pathway in the forward direction [102]. 
Accordingly, the native enoyl CoA reduc- 
tase of C. acetobutylicum, that is, butyryl 
CoA dehydrogenase (bcd) was replaced by 
the irreversible trans-enoyl CoA reductase 
(ter) from Treponema denticola. In addi- 
tion, two of the enzymes in the butanol 
production pathway (hbd and crt) were re- 
placed by enzymes producing stereospe- 
cific end products (phaB and phaJ) from 
Ralstonia eutropha and Aeromonas caviae, 
respectively. These efforts, coupled with 
overexpression of the pyruvate dehydro- 
genase complex aceEF-Ipd (for increased 
production of both acetyl CoA and NADH) 
resulted in a dramatic increase in the 
butanol titer to 4.65 g1-! [102]. In an in- 
dependent study, the atoB and ter genes 
(from E. coli and T. denticola, respectively) 
were recombinantly expressed in an E. coli 
AldhA AfrdA AadhE strain, along with 
the other three clostridial enzymes of the 
butanol production pathway namely, hbd, 
crt, and adhE2 [103]. As the host strain 
has an NADH surplus, it is unable to 
grow in anaerobic conditions unless it 
has an active NADH-consuming pathway, 
such as the butanol production route. The 
recombinant strain was able to produce 
1.8g1-! in 24h. Subsequently formate de- 
hydrogenase from Candida boidinii was 
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overexpressed so as to reduce pyruvate 
accumulation in the cell, as well as to gen- 
erate excess NADH in the process. The 
endogenous phosphotransacetylase (pta) 
gene was also knocked out so that ac- 
etate production could be reduced and 
cellular acetyl CoA levels could be in- 
creased. The final strain was now able 
to produce 15 g1"! butanol in three days, 
achieving an impressive 88% of the the- 
oretical yield [103], which was increased 
further to 30g1-! by fermentation opti- 
mization and gas stripping. 

Butanol production was also monitored 
in other organisms with a higher butanol 
tolerance than E. coli, such as Bacillus sub- 
tilis and Pseudomonas putida [104]. The 
recombinant P. putida strain could pro- 
duce 50mgl1"! butanol using glucose as 
a carbon source, and 122mg]! using 
glycerol [105], whereas the recombinant 
B. subtilis with gene deletions of amyE, 
thrC, and pyrD produced 23 mg 1"! butanol 
[105]. Butanol production in a photosyn- 
thetic organism was also attempted, such 
that the need for an external addition of 
any carbon source was completely elim- 
inated. Genomic integration of the atoB, 
hbd, crt, ter, and adhE2 genes at two loci 
resulted in a butanol titer of 3.04mg1! 
in seven days, while a rearrangement of 
the same genes increased the final titer to 
13.6 mg]! [106]. 

Instead of creating a completely new 
pathway in bacteria or yeast by recombi- 
nant expression, efforts were also made 
to produce butanol by modifying the fatty 
acid metabolism. The expression of genes 
of the E. coli fatty acid beta-oxidation 
cycle (i.e., atoB, fadA, fadB, fadE, and 
adhE) was altered by integrating a highly 
active promoter and ribosome-binding 
site upstream of them. This strain was 
able to produce 897 mg1"! butanol using 
glucose under semi-aerobic conditions 


[107]. Subsequently, it was shown the 
beta-oxidation cycle could be essentially 
reversed on constitutive expression of the 
fad-ato regulon in the presence of a fatty 
acid carbon source [108]. The same group 
also improved the strain efficiency by mu- 
tating the crp gene and knocking out the 
arcA gene; both of these approaches led 
to an inhibition of any repression of the 
beta-oxidation pathway. Overexpression of 
the endogenous fucO (t-1,2-propanediol 
oxidoreductase) and ygeF (acyl trans- 
ferase), along with knockout of the genes 
responsible for side product formation 
(ie., adhE, yqhD, eutE, which is respon- 
sible for ethanol production, pta and frdA) 
led to a butanol titer of 2.2 g1-! which was 
increased to 14 g1"! by optimization of the 
fermentation parameters [109]. 

The butanol formation route was also 
established in E. coli via the amino acid 
biosynthesis pathway. In this pathway, 
2-keto acids are converted to the corre- 
sponding aldehydes by keto acid decar- 
boxylases, and then reduced to alcohol. 
When the decarboxylases from five differ- 
ent organisms were studied, kivD from 
Lactococcus lactis was found to be the best 
[110]. The overexpression of alsS (ace- 
tolactate synthase from B. subtilis), ilvC, 
ilvD (endogenous genes of the isoleucine 
biosynthetic pathway), kivD and adh2 (al- 
cohol dehydrogenase from S. cerevisiae) in 
an E. coli AadhE AldhA AfrdA Apta Afnr 
ApflB strain led to an isobutanol titer of 
22 g1-! accompanied by 667 mg1"! of bu- 
tanol and 541mgl-! of propanol under 
microaerobic conditions [110]. Microaer- 
obic conditions are difficult to reproduce 
in large-scale cultivations, and completely 
anaerobic growth is generally preferred. 
However, amino acid synthesis generally 
occurs in the presence of oxygen as two 
of the enzymes require NADPH as the 
cofactor; this is regenerated only via the 


PPP and the tricarboxylic acid (TCA) cycle 
(both of which are aerobic pathways) [111]. 
When the cofactor dependence of two en- 
zymes — ilvC and adhA - was altered from 
NADPH to NADH, using a directed evo- 
lution approach [112, 113], the resultant 
strain was capable of producing 13.4g17! 
isobutanol under completely anaerobic 
conditions in minimal media [112]. In 
another strategy, the enzyme citramalate 
synthase (CimA) from the thermophile 
Methanococcus janaschii, which provides a 
more direct route for 2-ketobutyrate syn- 
thesis, was used [114]. Basically, CimA 
catalyzes the acetylation of pyruvate to 
(R)-citramalate, which is then converted 
to 2-ketobutyrate by leuBCD. CimA was 
modified by both random mutagenesis 
and DNA shuffling techniques so as to 
improve its affinity for pyruvate. An E. coli 
AilvA AtdcB overexpressing the kivd, adh2, 
and improved CimA was able to produce 
2.8g1-! propanol and 393 mg1"! butanol 
under anaerobic conditions in minimal 
media [114]. 

Isobutanol formation was checked in the 
cyanobacterium S. elongates PCC 7942 by 
genomic integration of the alsS, ilvC, ilvD, 
and kivD genes, and the engineered strain 
was able to produce 723mgl! isobu- 
tyraldehyde in 12 days [115]. No isobutanol 
production was observed, most likely due 
to the low activity of the endogenous alco- 
hol dehydrogenase. Rubisco from another 
cyanobacterium, S. elongates PCC6301, 
was then inserted downstream of the en- 
dogenous rbcLS genes and an alcohol 
dehydrogenase from E. coli, yqhD. This 
led to isobutyraldehyde and isobutanol 
productions of 1.1 g1-! and 450 mg1-}, re- 
spectively in eight days [115]. Isobutanol 
production in Corynebacterium glutamicum 
was also attempted, due to its rapid growth 
rate, ease of genetic manipulation and 
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highly active amino acid biosynthetic path- 
way (because of which it is used for the 
commercial production of various amino 
acids). Briefly, alsS, ilvC, ilvD, kivD, and 
the endogenous alcohol dehydrogenase 
gene adhA, were overexpressed to produce 
2.6g1-! isobutanol in 48h [116]. Subse- 
quently, the pyruvate carboxylase gene pyc 
was knocked out so as to prevent the con- 
version of pyruvate to oxaloacetate and 
enable maximal pyruvate flux for amino 
acid biosynthesis. Lactate dehydrogenase 
was also deleted so as to prevent the 
accumulation of lactic acid and simultane- 
ously build up the cellular concentration of 
pyruvate. The recombinant double knock- 
out strain produced 4.9g17' isobutanol 
[116] in 120h. B. subtilis was also tested 
for isobutanol production because of its 
higher tolerance for the end product as 
compared to E. coli or C. glutamicum. 
The alsS, ilvC, ilvD, kivD, and endogenous 
alcohol dehydrogenase adh2 genes were 
expressed in a two-plasmid system, and 
a final isobutanol titer of 2.62 gl-! with a 
productivity of 0.086 g1-! h~! was obtained 
after fermentation optimization [117]. 
Overexpression of the valine biosyn- 
thetic genes ILV2, ILV3, and ILV5, along 
with a branched-chain amino acyl trans- 
ferase BAT2 in S. cerevisiae, led to an 
isobutanol yield of 4.12 mg g™! glucose in 
complex media under aerobic conditions 
[118]. Keto-acid decarboxylases and alco- 
hol dehydrogenases from different sources 
were overexpressed and their effects on 
isobutanol titer compared. A combination 
of kivd (L. lactis) and ADH6 (S. cerevisiae) 
was found to be the best in terms of yield 
[119]. In addition, the committed step of va- 
line biosynthesis ILV2 was overexpressed 
and an isoenzyme of the pyruvate dehy- 
drogenase complex (pdc1) was knocked 
out (to prevent acetaldehyde formation). 
An isobutanol titer of 75mgl"! in 72h 
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was observed, and this was increased to 
143mgl"! in 120h under microaerobic 
conditions [119]. 

3-Methylbutanol (3MB), another poten- 
tial biofuel molecule with an energy den- 
sity of 30.5MJ1-! was also produced re- 
combinantly in E. coli. The genes of the 
valine biosynthetic pathway were overex- 
pressed along with the kivd and adh2 genes 
to produce 56mgl-! 3MB under anaero- 
bic conditions [120]. The production was 
further improved by overexpressing alsS 
(from B. subtilis) and the leuABCD genes, 
while knocking out genes participating in 
competing pathways; consequently, a fi- 
nal 3MB titer of 1.28g1-! and 0.2g1"! 
isobutanol was obtained after 28h under 
anaerobic conditions [121]. This strain, 
when subjected to many rounds of ran- 
dom mutagenesis and selected using the 
leucine analog 4-aza-d, l-leucine, produced 
4.4¢1-! 3 MB in 36h, which was increased 
to 9.6 gl"! when a two-phase fermentation 
strategy was employed [121]. 

The butanol production pathway 
described by Shen et al. [103] was slightly 
modified so as to produce 1-hexanol. 
Basically, a 2-ketothiolase (BktB) from 
Ralstonia eutropha that can condense 
an acetyl CoA moiety to butyryl CoA 
to form 3-ketohexanoyl CoA was used 
[122]. An E. coli AldhA AfrdA AadhE 
strain was used as the host strain, but 
no hexanol was detected. However, on 
coexpression of the ter gene from both 
T. denticola and Euglena gracilis, 23mg1"! 
hexanol was detected, which was further 
increased to 47mgl-! on knocking out 
of the pta gene and overexpression of 
the formate dehydrogenase gene fdh 
from C. boidinii [122]. In another study, 
a structure-based protein engineering 
approach was adopted to change the 
substrate specificity of the levA and kivD 
so as to produce higher-chain alcohols 


[123]. Site-directed mutations were created 
in the keto-acid binding site of both 
enzymes so that they could effectively 
interact with longer keto-acid substrates. 
The product profile of the recombinant 
strain varied depending on the amino 
acid mutation incorporated. For example, 
a G642D and a $139G mutation together, 
enabled a production of 793.5mgl-} 
3-methyl-1-pentanol [123], while other 
mutations led to different products 
such as 4-methyl-1-hexanol (51.9 mg1-'), 
5-methyl-1-heptanol (22.5 mg1~'). Further 
modification of the leuA gene, based 
on quantum mechanical modeling and 
protein—substrate complex modeling, 
coupled with overexpression of the 
threonine biosynthetic genes and deletion 
of the threonine transporter, resulted in 
the production of butanol, 1-pentanol, 
1-hexanol, 1-heptanol, and 1-octanol in 
minimal media, although the titer of the 
products decreased with increasing chain 
length [124]. 


6.3 
Fatty Acid-Based Biofuels 


The other major class of biofuels, such as 
fatty acid methy] or ethyl esters, alka(e)nes, 
fatty alcohols, is currently derived via 
the fatty acid synthesis pathway. The ma- 
jor advantage of these compounds is that 
they have similar properties to biodiesel. 
To date, most of the metabolic engineer- 
ing effort to produce such molecules has 
been attempted in E. coli. In general, bac- 
teria do not produce fatty acids in excess, 
and consequently genetic manipulations 
were required to boost the cellular con- 
centrations of fatty acids, which serve as 
the precursors of these biofuel candidates. 
Almost all such efforts have involved an 
overexpression of the thioesterase enzyme, 
which cleaves the growing fatty acyl acyl 


carrier protein (ACP) chain from the fatty 
acid synthase complex, resulting in free 
fatty acid accumulation. The nature of the 
fatty acids produced can also be altered 
in terms of chain length and degree of 
unsaturation, depending on the source of 
the enzyme [125-128]. Overexpression of 
acetyl CoA carboxylase (ACC) - the first 
and also the rate-limiting step of the fatty 
acid biosynthesis pathway — also led to an 
increase in free fatty acids [129]. A com- 
bined strategy of overexpression of ACC 
(from E. coli) and a thioesterase from Um- 
bellularia californica, and deletion of the 
beta-oxidation pathway, enabled a seven- 
fold increase in fatty acid titer as compared 
to the control, and also a predominance 
of C12 and C14 fatty acids [130]. In an- 
other study, thioesterases from both E. 
coli and Cinnamonum camphorum were 
overexpressed in E. coli along with the 
native ACC, while the beta-oxidation path- 
way was inhibited due to knockout of the 
fadD gene [125]. The recombinant strain 
produced 2.5 g1~! fatty acids in a fed-batch 
cultivation. 

Biodiesel is composed of FAMEs or fatty 
acid ethyl esters (FAEEs), and is tradition- 
ally produced by the trans-esterification 
of vegetable oils. The microbial produc- 
tion of esters was achieved by a simul- 
taneous production of fatty acids and 
an alcohol molecule, preferably ethanol 
due to its low toxicity as compared to 
methanol. FAEE synthesis in E. coli was 
first achieved by the overexpression of pdc 
and alcohol dehydrogenase (adhB) genes 
from Zymomonas mobilis (for ethanol pro- 
duction) and an acyl transferase ADP1 
from Acinetobacter baylii. The recombi- 
nant strain produced 1.3 g1-! FAEE from 
exogenously added oleic acid [131]. In an- 
other study, the endogenous thioesterase 
was overexpressed along with ADP1 and 
fadD, and 400 mg1-! FAEEs was produced 


Synthetic Biology in Biofuels Production 


on the exogenous addition of ethanol 
[132]. In the same study, overexpression 
of the pdc and adhB genes helped to 
completely avoid any ethanol addition. In- 
creasing the endogenous free fatty acid 
pool by overexpression of ACC led to a 
total FAEE titer of 922mgl~! after op- 
timization of the fermentation parame- 
ters [133]. A recombinant production of 
FAME in E. coli was also achieved by the 
overexpression of a fatty acyl methyltrans- 
ferase from Mycobacterium marinum, that 
uses S-adenosylmethionine as the methyl 
group donor [134].Wax esters are a closely 
related category of biofuel molecules, and 
although they are produced naturally by 
plants in small quantities, their microbial 
production is currently under investiga- 
tion. Recombinant expression of the acyl 
CoA reductase from the jojoba plant, along 
with ADP1 gene from A. baylii, led to the 
esterification of fatty alcohols and fatty 
acyl CoA thioesters to form palmitoyl 
oleate, palmitoyl palmitoleate, and simi- 
lar molecules [135]. Wax formation was 
also shown by an overexpression of acyl 
CoA reductase (from Mus musculus), tesA, 
fadD, and ADP1inan E. coli AfadE strain 
[132]. 

Triacylglycerols (TAGs) are the most 
common category of neutral lipids pro- 
duced in bacteria of certain genera such as 
Rhodococcus, Mycobacterium, and Strepto- 
myces. An E. coli AglcA that accumulates 
high concentrations of diacylglycerol due 
to deletion of the diacylglycerol kinase A 
gene, produced TAG when a diacylglycerol 
acyltransferase (DGAT) from Streptomyces 
coelicolor was expressed recombinantly 
[136]. In another approach, the endoge- 
nous phosphatidic acid was dephosphory- 
lated so as to accumulate diacylglycerol, 
which was then esterified to fatty acyl CoA 
to produce TAG. An E. coli strain over- 
expressing the ADP1 and phosphatidate 
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phosphatase (pgpB) produced 1.1mg1"! 
TAG [137]. TAG production was again ob- 
served in the E. coli AgicA strain expressing 
both DGAT and phosphatase from S. coeli- 
color [138]. 

Methyl! ketones are aliphatic molecules 
that are commonly used in flavors and fra- 
grances, but are also potential biofuel can- 
didates due to their high cetane number. 
These compounds are produced naturally 
in plants and animals, but not in most 
bacteria. Expression of the methyl ketone 
synthase genes I and II (ie., shkmsI and 
shkmsII) from the tomato plant (Solanum 
habrochaites) in E. coli led to the production 
of methyl ketones [139]. Overexpression of 
these genes in a strain lacking the fermen- 
tative pathways (i.e., E. coli AadhE AldhA 
Apta ApoxB) resulted in a high titer of 
450mgl! [140]. Methyl ketone synthe- 
sis was also achieved via another strat- 
egy involving the overexpression of fadB, 
fadA, and Mlut1700 (an acyl CoA oxidase 
from Micrococcus luteus) in a strain hav- 
ing an impaired fatty acid beta-oxidation 
pathway (ie., E. coli AfadE AfadA) [141], 
and 380mg1-! methyl ketones were pro- 
duced by this strain. In a similar approach, 
fadB, fadA, and Mlut1700, and the endoge- 
nous thioesterase, were recombinantly ex- 
pressed in a Ralstonia eutropha strain in 
which both putative beta-oxidation oper- 
ons had been knocked out. This strain pro- 
duced 180mg1-! methyl ketones under 
chemolithoautotrophic conditions, using 
only CO, and Hp as carbon and electron 
sources, respectively [142]. 

Long-chain fatty alcohols produced by 
the reduction of fatty acids can also serve 
as biofuels, and are currently produced 
either from petroleum, or by the hydro- 
genation of FAMEs obtained from plant 
oils. The microbial production of fatty al- 
cohols was achieved by the overexpression 
of fadD, acrI (an acyl CoA reductase from 


Acinetobacter calcoaceticus BD413) and the 
endogenous tesA in E. coli AfadE, and a 
titer of 60mg1-! was observed [132]. An 
alcohol titer of 330mg1-! was produced 
by a functional reversal of the fatty acid 
beta-oxidation cycle. Briefly, the transcrip- 
tional regulator fadR was deleted, enabling 
a constant expression of the beta-oxidation 
genes and thus aerobic growth of the mu- 
tant E. coli strain on medium-chain length 
fatty acids. In addition, a mutation was 
introduced in crp deregulating catabolite 
repression and the regulatory gene atoC, 
while the arcA gene was knocked out. 
These mutations allowed the bacterium 
to survive on short-chain fatty acids and, 
when coupled with a deletion of the fer- 
mentative pathway and overexpression of 
the fadB, fadA, and fadM genes, resulted in 
a fatty alcohol titer of 330mg1"! [109]. In 
another approach, a carboxylic acid reduc- 
tase (CAR) from Mycobacterium marinum 
that converts fatty acids directly to their 
corresponding aldehydes, was expressed 
in E. coli. The endogenous aldehyde re- 
ductase and thioesterase were also over- 
expressed, and 350mg"! of C8-c22 fatty 
alcohols was produced [143]. 
Hydrocarbons such as alkanes and 
alkenes are also currently produced in E. 
coli. The first report of alkane production 
in E. coli involved the expression of an 
acyl ACP reductase (aar) and an aldehyde 
decarbonylase (adc) from Synechococcus 
elongates, and a mixture of both alkanes 
and alkenes in the range of C13—C17 was 
obtained [144]. Ina related study, these two 
genes were expressed along with the fabH2 
from B. subtilis, as this has a more relaxed 
substrate specificity as compared to the 
same gene from E. coli. This enabled the 
recombinant strain to produce odd-chain, 
even-chain, and also branched-chain alka- 
nes [145]. Alkanes were also produced di- 
rectly from fatty acids instead of acyl ACPs. 


This approach is more advantageous be- 
cause fatty acids are more abundant in 
cells as compared to acyl ACPs, and also 
because the nature and quantity of the 
fatty acid pool can be conveniently al- 
tered by various genetic manipulations. 
The fatty acid reductase complex from the 
bioluminescent bacterium Photorhabdus 
bioluminescens and aldehyde decarbonylase 
from Nostoc punctiformis were recombi- 
nantly expressed in E. coli, along with 
fabH2 and the thioesterase (fatB) from 
Cinnamonum camphorum (specific for C14 
acyl ACP). Although the alkane titer was 
low, at 10mglI"!, this strategy provided a 
method for customizing the final product 
as per requirement [146]. In another re- 
cent study, alkanes were produced in an 
E. coli AfadE AfadR strain by expressing 
thioesterase, fatty acyl CoA reductase from 
C. acetobutylicum, and a fatty aldehyde 
decarbonylase from Arabidopsis thaliana. 
Both gene deletions allowed an enhance- 
ment of fatty acid levels, and the final 
engineered strain produced 580.8mg17! 
of alkanes, ranging from C9 to C14 [147]. 


6.4 
lsoprenoid-Based Biofuels 


The isoprenoids are another class of 
compounds that is emerging as potential 
biofuels or biofuel precursors. For 
example, the sesquiterpene farnesene 
can be used as either a diesel or jet fuel 
after hydrogenation, while bisabolene 
can similarly be hydrogenated to produce 
the diesel substitute, bisabolane [148]. 
Bisabolene synthase from Abies grandis, 
when expressed in a S. cerevisiae strain 
already engineered to overproduce the 
precursor farnesyl pyrophosphate, led 
to a bisabolene titer of 994mg! [148]. 
Similarly, 380mgl"! farnesene was 
produced in E. coli by expressing a 
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fusion of the codon-optimized farnesene 
synthase (from apple) and farnesyl 
pyrophosphate synthase [149]. Isoprene, 
which can serve as the precursor for 
jet fuel, was produced in E. coli by the 
heterologous expression of either the 
mevalonate or MEP pathway. Basically, 
recombinant expression of the first two 
genes of the MEP pathway from B. subtilis 
and isoprene synthase from Populus nigra 
enabled the production of 314mgl"! 
isoprene [150]. The theoretical yield of 
isoprene from glucose via the MEP 
pathway is higher than via the mevalonate 
pathway, although by modifying the latter 
a higher titer was achieved. Briefly, the 
HMG = (3-hydroxy-3-methylglutaryl)-CoA 
synthase, acetyl CoA acetyltransferase, 
and HMG-CoA reductase from Entero- 
coccus faecalis was expressed in E. coli 
so as to increase the mevalonate pool 
in the cell. Additionally, the endogenous 
mevalonate synthase was mutated and an 
isoprene synthase from Populus alba was 
recombinantly expressed. The combined 
effect of these modifications allowed 
an isoprene titer of 6.3gl! in 40h 
through a fed-batch fermentation strategy 
[151]. 


7 
Role of Synthetic Biology in Algal Fuels 


Microalgae, including cyanobacteria, are 
considered as potential feedstock for re- 
newable biofuels capable of converting 
atmospheric CO, to substantial biomass 
and valuable biofuels, along with low- 
or/and high-value products of commercial 
importance. The production of a broad 
range of commercial products including 
biofuels, nutraceuticals, therapeutics, in- 
dustrial chemicals and animal feeds have 
shifted the research paradigm significantly 


687 


688 


Synthetic Biology in Biofuels Production 


towards the production of bioenergy con- 
stituents under low-cost platforms, and 
further improvements through genetic 
engineering will enable and enhance 
algae-based bioproducts [152]. However, 
the transgenic microalgae are yet to be ex- 
ploited for commercialization. The major 
concern among microalgae genetic engi- 
neering is a lack of molecular tools and 
an overall poor expression of heterolo- 
gous genes from the nuclear genome of 
many microalgae species, at least partially 
due to their rapid silencing [153-155]. As 
tools for synthetic biology and genetic 
engineering in algae are yet to be de- 
veloped, or are still in their infancy, the 
concept of “‘BioBricks,’”’ developed by Bio- 
Bricks Foundation (http://biobricks.org/), 
relates to standardized DNA parts which 
have a common interface and can be as- 
sembled in living organisms. The parts 
are the basic interchangeable elements 
for regulating genetics, focusing on the 
development of the most common Bio- 
Bricks for cyanobacteria and algae such 
as promoters, transcriptional terminators, 
ribosome-binding sites, and other regula- 
tory factors [156]. 

Another upcoming scenario is the in- 
tegration of “Omics” research in under- 
standing the behavior of biological sys- 
tems as a whole, where metabolic path- 
ways are often highly regulated and con- 
nected with a number of both feed-forward 
and feed-back mechanisms that can 
act positively and/or negatively, ulti- 
mately affecting the system’s output [157]. 
An understanding of the entire sys- 
tem through integrated ‘““Omics” research 
will lead to the identification of relevant 
enzyme-encoding genes, and allow the 
reconstruction of metabolic pathways in- 
volved in the biosynthesis and degradation 
of precursor molecules that, in turn, may 
have the potential for biofuel production, 


aiming towards the vision of tomorrow’s 
bioenergy needs. 

Currently, the following strategies are 
employed with regards to the fundamental 
aims of engineering processes for microal- 
gae, together with global R&D efforts and 
attempts toward commercialization: 


1. The selection of promising strains with 
lipid productivity and lipid quality. The 
criteria also include adaptation to the 
local environment and their ability to 
use nutrients, to provide environmental 
and economic benefits, and to produce 
byproducts with commercial relevance 
[158]. 

2. The relevance of growth factors such 
as light, CO), temperature, nutrient 
availability and salinity, are crucial 
for higher biomass, and also play a 
predominant role in influencing lipid 
productivity [159]. 

3. Most cultivation conditions are broadly 
categorized into four major types, 
based on their different energy and 
carbon sources, such as autotrophic, 
heterotrophic, mixotrophic, and pho- 
toautotrophic [160]. Depending on the 
preferences of final product and condi- 
tions, two cultivation systems are usu- 
ally widely employed: (i) open scale-up 
microalgae production (such as open 
ponds or raceway ponds), which are op- 
erated under phototrophic conditions 
[161, 162] ; and (ii) closed photobioreac- 
tors which are used for heterotrophic 
algae with wastewater and organic 
byproducts [158, 163]. 

4. Approaches for the enhancement 
of lipid productivity still need to 
be verified by different methods 
of genetic/metabolic engineering, 
where modifications involve target 
improvement of cellular activities by 
manipulation of enzymatic, transport, 
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Fig. 6 Major improvements exploited through synthetic 
biology for economic microalgal biofuel production. 


and regulatory functions of the 
photosynthetic cells using biological 
modulators [164] and engineering 
methods [165, 166]. 

5. Finally, harvesting of the algal biomass, 
dewatering, the extraction of lipids, and 
the conversion of lipids to biodiesel, 
are all highly significant procedures in 
biofuel production that can reduce pro- 
duction costs for biodiesel by 40-60% 
[158, 160, 167]. 


Microalgal growth can be better under- 
stood by the interaction of algal biofuel pro- 
duction with the environment through the 
development of models describing their 
usefulness for designing efficient biore- 
actors, predicting process performance, 
and optimizing operating conditions [168, 
169]. Models of life cycle assessment (LCA) 
can provide information about technol- 
ogy, economics, and sustainability [159], 
along with an evaluation of the inputs and 
outputs and their potential environmental 
impacts of a product [170]. Notably, the 
final outcome may vary with different cul- 
ture systems, and the different methods 
used for biomass harvest and oil refinery 
[171-174]. 

Recently, a scenario was proposed for 
the development of synthetic chromosome 


technology for algae that will accelerate the 
development of algal strains with a coordi- 
nated, predictable, and stable expression of 
desirable metabolic modifications (Fig. 6) 
[175]. Three major improvements are re- 
quired for economical microalgal fuel pro- 
duction: 


e Improving photosynthetic efficiency 
through genetic/metabolic engineer- 


ing in microalgal mass-culturing 
procedures. 

e Channeling of fixed carbon into 
higher-value fuel products of 


commercial relevance. 

e The development of robust, ‘“engi- 
neered” algal cells that will persist 
and compete in lower-cost open or 
semi-open environments. 


Hopefully, the above discussions will 
provide some insights into major concerns 
associated with microalgal biofuel produc- 
tion, such as high capital investments, op- 
erational costs, and contamination-based 
problems. 
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Antibiotic: 
A chemical compound with highly efficient and specific antimicrobial activity, often 
used to treat microbial infectious diseases. 


Awakening: 
The genetic engineering of a silent or cryptic gene cluster, for example, by refactoring, 
to achieve substantial levels of production of the cluster-specific natural product. 
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Combinatorial biosynthesis: 

A general approach for creating libraries of new chemical variants of natural products 
by systematic high-throughput domain shuffling and related genetic engineering 
approaches. 


Cryptic gene cluster: 

A gene cluster encoding the biosynthesis of a natural product that is not detectable in 
standard assays and growth conditions, for example, because the amounts produced are 
below the detection threshold. 


Dereplication: 
The process of determining if a newly produced pharmaceutically active natural product 
is truly novel or had been studied before. 


Domain shuffling: 

The targeted or random exchange of catalytic domains involved in the iterative 
biosynthesis of a natural product, in particular one produced by a multimodular 
enzyme, with the aim of generating new chemical variants of the original compound. 


Gene cluster: 

A group of genes that are close neighbors in the genome of an organism and act as a 
functional (and often evolutionary) unit. In bacteria, naturally produced antibiotics are 
usually the product of a well-delimited gene cluster. 


Mutasynthesis: 

A strategy for creating unnatural chemical variants of a natural product by replacing 
one (or more) of the natural biosynthetic precursors by a chemically synthesized analog 
acceptable to the subsequent biosynthetic enzymes. The pathway for the production of 
the natural precursor is usually deleted by genetic engineering. 


Natural product: 
A chemical compound produced by a biological system and often showing pharmaceuti- 
cal activity. The majority of antibiotics in clinical use are derivatives of natural products. 


Refactoring: 

The complete re-synthesis of the genetic sequence encoding a particular biological 
function, maintaining the encoded protein information, but removing all endogenous 
regulatory interactions and replacing them by engineered regulatory circuitry. 


Retrobiosynthesis: 

The design and engineering of complete biosynthetic pathways for chemical compounds 
not produced in Nature, based on a knowledge of its chemical structure and combining 
enzymes from a variety of organisms and/or enzymatic activities engineered de novo. 


Silent gene cluster: 
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A gene cluster encoding the biosynthesis of a natural product that is not produced in 
standard growth conditions, for example, because of strong repressive regulation. 


Antibiotics, and many related bioactive secondary metabolites, are synthesized 
by highly modular molecular machines, encoded by large multigene clusters in 
microbial genomes. Their natural modularity makes these biosynthetic machines 
attractive targets for an engineering-inspired synthetic biology approach. Synthetic 
biology can be used to rapidly identify novel antibiotics by a systematic 
awakening of silent gene clusters detected by comprehensive genome mining. 
It can also complement the established strategies of combinatorial biosynthesis 
and mutasynthesis, to enable the rapid generation of new chemical diversity. 
Finally, synthetic biology can expand the toolbox of metabolic engineering to 
construct powerful host strains (chassis) for the industrial production of bioactive 


metabolites. 


1 
Introduction 


Many applications of synthetic biology aim 
at the engineering of metabolism, with 
the purpose of either producing new com- 
pounds or improving the production of 
known ones, making it more economical, 
sustainable, greener, and cleaner. In this 
context, antibiotics could be considered as 
just one more class of metabolites to be 
subjected to the same general toolbox of 
synthetic biology. 

Why is the synthetic biology of an- 
tibiotics special? The specific bioactivity, 
the property of efficiently and specifically 
killing or inhibiting the growth of mi- 
crobes, cannot be relevant in this context. 
Instead, the discriminating features have 
to be the result of the special biological role 
and evolution of antibiotics. A number of 
these features can be defined, and they 
justify a special consideration of synthetic 
biology approaches to antibiotics research 
[1]. In their natural role, antibiotics are 


compounds acting at the interface be- 
tween organisms. They can be antagonistic 
(antibiotics in the strict sense aimed at 
killing or inhibiting surrounding organ- 
isms) or have more neutral functions (e.g., 
as communication molecules within and 
between species). In both cases, they are 
subjected to an ‘“‘arms race’’ situation in 
evolution: organisms will gain an evolu- 
tionary advantage if they are able to rapidly 
acquire and modify their repertoire of an- 
tibiotics, to avoid the emerging resistance 
of their antagonists, or to escape molecular 
eavesdropping by competitors. 

As a result of this evolutionary pressure 
for diversification and rapid change, an- 
tibiotics and their relatives not only show 
an amazing chemical complexity, but the 
biosynthetic pathways responsible for an- 
tibiotic production are also arranged with 
an unusual degree of modularity, that 
allows for rapid diversification and hori- 
zontal transfer [2] and can be considered 
a preadaptation for the engineering strate- 
gies of synthetic biology. 
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Modularity of the antibiotics biosynthe- 
sis pathways is seen at multiple levels: 
it can be manifested within the biosyn- 
thetic enzymes, which are often huge 
multidomain assembly line structures [3]; 
within the synthetic pathway, which con- 
sists of biosynthetic core units and an 
exchangeable set of modifying and deco- 
rating enzymes generating molecular di- 
versity in the end product; and finally mod- 
ularity affects the entire pathway, which in 
all cases considered here is encoded by 
a compact genetic unit (a gene cluster) 
which can be transferred between organ- 
isms as an independent functional module 
and is expected to function in its new con- 
text without major rearrangements of the 
metabolism of the receiving host species. 
Again, this is an idea that closely matches 
the concepts of synthetic biology, such as 
orthogonality and insulation [4]. 

Based on these observations, the topic 
of this chapter can be redefined as the 
synthetic biology of complex metabolites 
produced by compact modular gene clus- 
ters, emphasizing the genetic peculiarities 
of the systems, rather than the somewhat 
arbitrary bioactivity implied by a focus on 
antibiotics. 

Naturally, despite the unique aspects of 
antibiotics chemistry and genetics, many 
challenges and developments in synthetic 
biology applied to antibiotics are shared 
with other areas of synthetic biology, and 
these will not be discussed here. In- 
stead, attention will be focused on three 
major topics that are specific to the an- 
tibiotics field: Synthetic biology methods 
for genome prospecting for new bioactive 
molecules; synthetic biology approaches 
for the rapid generation of chemical diver- 
sity; and the engineering of versatile chas- 
sis for the biotechnological production of 
antibiotics based on synthetic biology ap- 
proaches. 


2 
The Need for Chemical Diversity 


From the perspective of human appli- 
cations, interest in synthetic biology for 
antibiotics is driven by a similar incen- 
tive as the evolution for these compounds, 
namely an imperative need for chemi- 
cal diversification. Any antibiotic use will 
rapidly lead to the emergence of resistance 
amongst the targeted microbes. This can 
occur at a surprising speed, again facili- 
tated by the evolutionary deep history of 
this class of compounds. Antibiotics have 
always been part of the evolutionary en- 
vironment of all microbes, and there is 
a natural reservoir of resistance mecha- 
nisms that, in the same fashion as an- 
tibiotics biosynthetic genes, can be rapidly 
transferred between species. This is com- 
plemented by the spread of resistance due 
to newly emerging mechanisms. No mat- 
ter what the origin of resistance might be, 
the emergence of resistance is guaranteed 
in any case, and it occurs rapidly. 

A comprehensive review of the his- 
tory and mechanisms of antibiotics re- 
sistance can be found on Wikipedia. 
The most serious threats of antibi- 
otics resistance are currently manifesting 
themselves in hospital settings, where re- 
sistance can rapidly evolve due to hor- 
izontal spread between pathogens and 
the particularly strong selective pres- 
sure exerted by widespread antibiotics 
use. However, community-acquired an- 
tibiotic resistant infections are increas- 
ingly common. Particularly serious are 
multidrug-resistant pathogens that de- 
velop resistance to some of the latest 
available effective antibiotics. Notorious 
examples are methicillin-resistant Staphy- 
lococcus aureus (MRSA), which emerged 
within years of the introduction of me- 
thicillin in the clinic. Within a few 


decades, this was succeeded by strains 
of vancomycin-resistant Staphylococcus au- 
reus (VRSA), which additionally showed 
resistance to the one remaining drug that 
at the time was still effective against 
S. aureus infections. Recently, the emer- 
gence of carbapenem-resistant Enterobac- 
teriaceae (CRE) has led to grave concerns 
that one of the very few remaining drugs 
of last resort is losing its effectiveness 
against a major group of pathogens, with 
potentially dire consequences for the pub- 
lic health system [5]. 

Only the continuous development of 
new antibiotics can prevent a breakdown 
of the current ability to control microbial 
infections. So far, this strategy has been ex- 
tremely successful, with new drugs always 
becoming available just in time to take over 
when the previous generation was lost due 
to pervasive resistance. However, there are 
today various reasons to be concerned that 
this trend may not last much longer. The 
development pipelines for new antibiotics 
are in danger of running dry, with a major 
“innovation gap” between 1962 and 2000, 
when no major new classes of antibiotics 
were entering the clinic [6]. The reason for 
this problem is largely economical: antibi- 
otics are used very briefly, both at the level 
of the individual patient (who tends to be 
cured after at most a couple of weeks) and 
at the level of the population (as emerging 
resistance quickly makes the drug lose its 
efficiency). This has two economic con- 
sequences: (i) it encourages the misuse 
of antibiotics (overprescription), with the 
aim of realizing its full economic potential 
while the drug is still working (and before 
patents run out); and (ii) it discourages 
costly efforts to develop new antibiotics, 
which will never be able to recover the huge 
development costs to the same extent that 
can be achieved with blockbuster drugs 
for the treatment of widespread chronic 
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disorders, which often need to be taken 
daily for the remaining lifetime of the pa- 
tient. Several other challenges represent 
further disincentives for major pharma- 
ceutical companies that are considering 
investments in antibiotics research [7]: 


1. There are currently no good chemoin- 
formatics approaches to predict the 
penetration of chemicals into micro- 
bial cells, equivalent to Lipinski’s “rule 
of five” filter for the identification of 
compounds which are likely to have the 
biophysical properties required for oral 
bioavailability. This lack of predictive 
filters leaves the chemical search space 
necessary to be screened for new an- 
tibiotic activities unmanageably large. 

2. Due to the lack of permeation into 
their microbial targets, antibiotics need 
to be administered at relatively high 
doses, despite their high innate speci- 
ficity. This increases the risk of losing 
promising antibiotics drug leads at a 
late stage of the development pipeline 
due to unacceptable toxicity. 

3. It is challenging to set up clinical 
trials for new antibiotics, as the rel- 
evant multidrug-resistant target infec- 
tions are still rare and obvious ethical 
concerns prevent placebo treatments 
for infected patients. In view of the in- 
creasing threat of resistant infections, 
the regulatory processes may become 
adjusted to reflect this special situation 
[3]. 


There are two major biosynthetic 
classes of antibiotics and related sec- 
ondary metabolites, which are particularly 
amenable to synthetic biology approaches, 
namely the products of polyketide 
synthases (PKSs) or of non-ribosomal 
peptide synthetases (NRPSs). The PKS 
can be further subdivided into three 
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major types, according to the mechanism 
of chain elongation of the growing end 
product. Both, type I PKS and NRPS 
share a highly modular assembly-line 
arrangement of catalytic domains, 
encoded by huge open reading frames 
(ORFs) [2, 9]. They are responsible for the 
synthesis of some of the most complex 
chemicals with important clinical activity; 
for example, rapamycin is a type I PKS 
product with potent immunosuppressant 
activity [10], while vancomycin — the drug 
of last resort for MRSA —is produced by 
an NRPS [11]. In contrast, type I] PKS 
and the chalcone synthases (type III PKS) 
are assembled from small single-domain 
enzymes, which form an active complex 
and are iteratively used during elongation 
of the growing carbon chain, similar 
to the mechanism seen in fatty acid 
biosynthesis [12, 13]. In all classes, in 
addition to the core biosynthesis genes, 
there is a huge diversity of accessory 
enzymes that contribute to the further 
processing and modification of the core 
structure produced by the PKS or NRPS. 


3 
Classical Approaches to Antibiotic 
Biosynthesis 


A solid molecular understanding of the 
natural biosynthetic pathways provides 
the basis for the engineering of antibiotic 
biosynthesis by synthetic biology. This is 
still a field of active innovative research, 
but a recent example illustrates the 
general strategies employed for pathway 
elucidation. Kaysser et al. [14] used 
genome mining in Streptomyces to identify 
the gene cluster responsible for the 
biosynthesis of napsamysins, a class of 
uridylpeptide antibiotics that was first 
isolated in 1994 and is active against 
pseudomonads via the inhibition of 


phospho-N-acetyl-muramylpentapeptide 

translocase I. While the biosynthesis of 
related nucleoside antibiotics had already 
begun to be unraveled [15], little was 
known about the biosynthetic pathways 
producing the uridylpeptides [16], though 
one of the few available hints derived 
from feeding experiments indicated the 
involvement of an unusual nonribosomal 
peptide synthase mechanism. A first 
step towards cluster identification, by 
screening a cosmid library using degen- 
erate primers for N-methylation domains, 
adenylation domains, and genes involved 
in diaminobutyric acid biosynthesis, 
failed. However, the obtained sequence in- 
formation indicated that the napsamycin 
producer strain was closely related 
to Streptomyces roseosporus, for which 
whole-genome information was available. 
A genome scan identified a putative 
aminotransferase with a likely function in 
the biosynthetic pathway. The sequence of 
the surrounding region was then used to 
design primers targeting the two ends of 
the putative cluster, which were employed 
for another round of cosmid library 
screening. This resulted in one positive 
cosmid that was sequenced and contained 
a total of 29 ORFs that, presumably, 
were involved in napsamycin biosynthesis 
and associated functions (regulation, 
transport, and resistance). An analysis of 
the biosynthetic mechanism revealed that 
the core of the napsamycin structure is 
synthesized by a collection of small (single 
domain or didomain) nonribosomal 
peptide synthetase-like proteins that act 
in a nonlinear manner. Heterologous 
expression of the cluster confirmed the 
assignment to napsamycin biosynthesis, 
paving the way for the creation of 
new chemical varieties of napsamycin. 
Moreover, this and related studies (e.g., 
Ref. [17]) not only enhance the ability to 


manipulate individual pathways but also 
provide the building blocks required for 
the assembly of novel pathways using the 
tools of synthetic biology. 

Very similar approaches were used, for 
example, in the recent molecular cloning 
and heterologous expression of the gene 
cluster responsible for the production of 
the vinyl phosphonate tripeptide antibi- 
otic dehydrophos from Streptomyces luridus 
[18]. In this case, a fosmid library was 
screened for a phosphoenolpyruvate mu- 
tase gene, which was thought to be re- 
quired for phosphonate biosynthesis. Het- 
erologous expression of a positive fosmid 
in Streptomyces lividans, followed by dele- 
tion analysis in combination with bioas- 
says and >!P NMR analysis, allowed the 
delineation of a minimal biosynthetic clus- 
ter containing 17 ORFs; the function of 
each gene could then be determined by in- 
dividual deletion in the heterologous host. 

Similarly, Young and Walsh [19] identi- 
fied the pathway for biosynthesis of a thia- 
zolyl peptide, a member of the most heav- 
ily post-translationally modified group of 
nonribosomal peptides, by fosmid library 
screening for a cyclohydratase gene in- 
volved in formation of the thiazoline ring, 
and transferred the gene cluster again into 
S. lividans for heterologous expression. 
Gene deletion and isotope feeding experi- 
ments in the heterologous host were then 
used for further elucidation of the details 
of the biosynthetic mechanism. 

As can be seen already from these 
examples, elucidating the biosynthetic 
pathway for a newly identified antibiotic 
is only the first step toward its further en- 
gineering. The second essential ingredient 
is the ability to express the pathway ina het- 
erologous host. Considerable progress has 
been made in this direction during recent 
years, and even very complex bioactive 
molecules encoded by huge gene clusters 
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can now routinely be produced in a wide 
range of hosts. An example is the total 
synthesis of an antitumor nonribosomal 
peptide, echinomycin, from Streptomyces 
lasaliensis in Escherichia coli [20]. The ability 
to transfer complex biosynthetic machin- 
ery into a well-characterized and indus- 
trially established microbial host not only 
has immediate benefits concerning stable 
cultivation, but the ease of genetic manipu- 
lation of E. colialso offers long-term advan- 
tages with respect to the engineering of the 
host metabolism for improved antibiotic 
production and the biochemical modifica- 
tion of the intended end products. In the 
case of echinomycin, the heterologous ex- 
pression required the introduction of 14 
biosynthetic genes (five for production of 
the quinoxaline-2-carbolxylic acid moiety, 
five for the peptide backbone biosynthe- 
sis and modification, and one for an acyl 
carrier protein) plus one resistance gene, 
distributed on three plasmids. This first 
example provided the proof-of-concept for 
heterologous expression of a large class of 
medicinally important nonribosomal pep- 
tides in E. coli. 

While Watanabe et al. transferred a 
biosynthetic pathway to a classical model 
organism, Li et al. [21] focused on the ex- 
pression of a polyketide, tetracenomycin, 
in a heterologous Streptomyces species, 
S. cinnamonensis, aiming at an improved 
production by exploiting an industrial pro- 
ducer strain that had earlier been opti- 
mized by classical strain improvement 
techniques to achieve high titers of a 
chemically related compound, monensin. 
In this case, substantially higher produc- 
tion levels were achieved than in the 
native host, Streptomyces glaucescens, but 
the challenge arose that feedback inhi- 
bition led to an accumulation of path- 
way intermediates rather than the native 
end product, indicating the potential for 
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further optimization using synthetic biol- 
ogy methodology. 

Not only in E. coli, but also in genome- 
sequenced and genetically tractable model 
species amongst Streptomyces, engineer- 
ing of the host is a promising strategy. 
Most importantly, this includes removal 
of the native production of biosynthetic 
pathways for antibiotics and other sec- 
ondary metabolites. In the studies of 
Li et al. [21], deletion of the monen- 
sis biosynthetic capacities did not affect 
the production of tetracenomycin. In the 
most intensely characterized Streptomyces 
species, S. coelicolor, a series of derivatives 
optimized for heterologous expression 
have been created, including a strain that 
lacks the endogenous biosynthetic path- 
ways for all characterized antibiotic com- 
pounds of the species (M1146, which lacks 
the gene clusters for actinorhodin, unde- 
cylprodigiosin, calcium-dependent antibi- 
otic, and the coelicolor polyketide (CPK) 
polyketides [22]), and a further deriva- 
tive with additional mutations that in 
other hosts increase the potential for 
heterologous production of antibiotics 
(M1154 [22]). Flinspach et al. [23] ex- 
amined the potential of these geneti- 
cally modified strains for the heterol- 
ogous expression of the biosynthetic 
gene clusters for two aminocoumarin 
antibiotics (coumermycin A; and cloro- 
biocin) and the liponucleoside antibi- 
otic caprazamycin. These authors found 
distinct differences between the two 
classes of compounds, in that while 
the aminocoumarins were unaffected by 
the host background, caprazamycin pro- 
duction was optimal in strain M1154. 
The underlying molecular mechanism for 
these differences has not yet been eluci- 
dated; however, an improved understand- 
ing of the interaction between pathway 
and heterologous host will be critical for 


the success of synthetic biology strate- 
gies. 

In the spirit of synthetic biology, the 
straightforward transfer of a biosynthetic 
pathway to a heterologous host can 
sometimes provide new insights into 
the function of the pathway itself. This 
is shown in the work of Park et al. 
[24] on the biosynthesis of kanamycin, 
a widely used aminoglycoside antibi- 
otic with a surprisingly complex and 
understudied biosynthetic pathway. Re- 
construction of the biosynthetic pathway 
from Streptomyces kanamyceticus in Strep- 
tomyces venezuelae (which does not pro- 
duce aminoglycosides natively) revealed 
an early branch point in the biosyn- 
thetic pathway: the first glycosyltrans- 
ferase, KanF, can also act promiscuously as 
an N-acetylglucosaminyltransferase, lead- 
ing to two parallel streams of early interme- 
diates. This immediately opened up new 
opportunities for the engineering of chem- 
ical diversity by pathway manipulations; 
for example, exchanging the glycosyltrans- 
ferase could modify the relative flux along 
the two branches, while the addition of 
various combinations of the later enzymes 
of the pathway could be used to create 
an entire library of various antibiotic com- 
pounds, some of them new. 

Wang et al. [25] achieved similar ex- 
pansions into the chemical space around 
known bioactive scaffolds by heterologous 
expression of various tetracycline biosyn- 
thetic pathways in Streptomyces species. 
The engineered host—pathway combina- 
tions yielded numerous new derivative 
compounds, and allowed the identification 
of a new class of tailoring enzymes that can 
be used to modify the core tetracycline scaf- 
fold ina targeted way in different positions. 

In the era of synthetic biology, het- 
erologous expression does not have to 
be restricted to the transfer of entire 


biosynthetic pathways in their original 
form. Shao et al. [26] have recently for 
the first time applied a refactoring strategy 
been derived from concepts in software 
development and originally developed by 
Voigt and colleagues for large cellular 
machineries, such as the nitrogen fixa- 
tion cluster of Klebsiella oxytoca [27], and 
aims at the removal of all native regula- 
tory mechanisms that might interfere with 
the functioning in a heterologous host. 
Shao et al. decoupled the pathway (which 
in their case was the spectabilin path- 
way from Streptomyces orinoci, which was 
silent in its original host) from the native 
regulatory circuitry by replacing original 
promoters with suitable strong heterolo- 
gous promoters (constitutive or inducible), 
and this resulted in the desired expres- 
sion level in the targeted host and culture 
conditions. For additional versatility, the 
system was modularized so that the core 
biosynthetic pathway could be combined 
ina “plug-and-play” scaffold with relevant 
“helper modules”’ specific for the chosen 
host species, which provide elements such 
as origins of replication, selection markers, 
and integrases. The resulting huge artifi- 
cial gene clusters were assembled using 
the DNA assembler approach developed 
previously by the same group [28]. 
Clearly, the synthetic biology of an- 
tibiotics is not going to replace the tra- 
ditional approaches, but will evolve in 
close interaction with the molecular char- 
acterization of natural biosynthetic gene 
clusters and their regulation [29]. Be- 
tween 2000 and 2008, more than 300 
new natural compounds with potential 
antibiotic activity were reported, includ- 
ing 145 metabolites from 13 structural 
classes that show promisingly high ef- 
ficiency (minimum inhibitory concentra- 
tions between 0.02 and 10ug ml}; [30)), 
the majority being phenolic compounds, 
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quinones, and alkaloids. These continuing 
studies of mining for natural products — al- 
beit mostly by small companies rather than 
major pharmaceutical producers [29] — is 
complemented by the ongoing discovery of 
entirely novel antimicrobial mechanisms. 
Russell et al. [31] recently reported the 
discovery of a superfamily of bacterial 
phospholipases that act as antibacterial 
effectors in inter- and intra-species interac- 
tions by degrading the major component 
of bacterial cell membranes, namely phos- 
phatidylethanolamine. Of special interest 
from a synthetic biology perspective are de- 
velopments at the interface of chemistry 
and biology, such as the establishment 
of a peptide-morpholino oligonucleotide 
conjugate which, in combination with a 
bacteriostatic cell-penetrating peptide, ex- 
erts antibiotic activity over a wide range of 
pathogen species [32]. 


4 
Synthetic Biology Methods for Genome 
Prospecting of New Bioactive Molecules 


The recent explosion in the number of 
fully sequenced bacterial genomes has led 
to an entirely new appreciation of the 
chemical diversity potentially produced by 
microbes. For example, when the model 
species of the actinomycetes, Streptomyces 
coelicolor, was first sequenced, it emerged 
that this well-studied species contained 
biosynthetic gene clusters for about four 
times as many secondary metabolites than 
had previously been identified using tradi- 
tional approaches [33]. Further sequences 
from other widely used actinomycetes con- 
firmed this pattern, which is now accepted 
to be universal. For example, Medema et al. 
[34] found that the genome of Streptomyces 
clavuligerus not only contains 23 putative 
secondary metabolite biosynthesis gene 
clusters, but is complemented by a huge 
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1.8MB megaplasmid encoding a further 
25 gene clusters for a wide range of com- 
pounds, including staurosporine, moeno- 
mycin, beta-lactams, and enediynes. This 
discovery implies that the larger part of 
the secondary metabolite universe is cryp- 
tic, either because the biosynthetic gene 
clusters are “sleeping’’ in the assayed 
conditions, or because their products are 
present in insufficient amounts for suc- 
cessful characterization. 

Moving beyond actinomycetes, Letzel 
et al. [9] analyzed the cryptic metabolome 
of 211 complete genome sequences from 
anaerobic bacteria, a group that was widely 
believed to be incapable of producing sec- 
ondary metabolites of any relevance. In 
this case, the analysis was focused on 
PKS and NRPS —two cluster types that 
can be reliably detected and are known 
to be responsible for the production of a 
broad range of bioactive metabolites. In 
a review of genomes from more than a 
dozen phyla, Letzel et al. found that even in 
this supposedly secondary metabolite-poor 
group, 69 out of 211 strains contained pre- 
dicted NRPS or PKS genes. The largest 
amounts were found in soil isolates, 
while extremophiles (e.g., Thermotogae), 
or pathogens (e.g., Spirochaetes) were 
clearly deprived of biosynthetic capacities. 
The biotechnologically important cellu- 
lolytic Clostridium cellulovorans, with seven 
NRPS, and NRPS-PKS hybrid clusters 
covering almost 2.5% of the genome, is 
the current “anaerobic champion” of sec- 
ondary metabolite diversity. 

Feng et al. [35] systematically expressed 
polyketide synthase systems derived di- 
rectly from environmental soil DNA sam- 
ples in Streptomyces albus, circumventing 
the need for culturing the native host 
species and thus expanding the pool of 
accessible sources of bioactive molecules 
to the millions of species that have so far 


been impossible to grow in the laboratory. 
However, special attention needs to be 
paid to the methods of DNA isolation 
and cloning, as the relevant gene clusters 
tend to be much larger than the sequences 
that are collected in typical metagenomics 
studies. Feng et al. [35] tackled this is- 
sue by screening environmental cosmid 
libraries for the presence of a widely 
distributed conserved marker gene, KSz, 
and subsequent transfer of the positive 
clusters into the expression host by con- 
jugation. Clusters were prioritized based 
on a phylogenetic analysis of their KS, 
sequences, and the production of novel 
metabolites was detected in small-scale 
(50ml) cultures first, and then scaled 
up to larger (>3 liter) fermentations for 
structural characterization. When a large 
number of low-abundance, clone-specific 
metabolites were detected, this was con- 
sidered an indication that only a partial 
gene cluster had been contained in the 
original cosmid, and overlapping cosmids 
were identified to assemble the complete 
biosynthetic machinery. The successful 
application of this strategy, for example, 
led to the identification of a novel polyke- 
tide that contained a pentacyclic ring sys- 
tem never previously reported in either a 
natural or a synthetic compound. 

Of course, the genome-based discovery 
of chemical novelty does not need to be 
restricted to the type of sources that have 
yielded the classical secondary metabolite 
producers (mostly various types of soil 
samples). For example, interesting new 
compounds have been obtained from the 
microbial symbionts of social insects or 
various marine species of animals [36], and 
in many of these cases not only the gene 
cluster but also the organism containing 
it is cryptic. Plants are another particularly 
rich source of secondary metabolites, and 
the transfer of biosynthetic pathways to 


microbial expression hosts can potentially 
overcome many of the limitations asso- 
ciated with the production of valuable 
bioactive compounds in plants. This is well 
illustrated by the groundbreaking stud- 
ies of the artemisinin pathway performed 
by Jay Keasling and colleagues, imple- 
menting the industry-scale production of 
an antimalarial plant-derived drug in Es- 
cherichia coli and Saccharomyces cerevisiae 
[37, 38]. 

The challenge now is twofold: (i) to 
reliably discover secondary metabolite 
gene clusters; and (ii) to awaken this 
cryptic potential for the production of new 
antibiotics. 

To address the first challenge, various 
bioinformatics solutions have been 
developed [39, 40]. This is facilitated by 
the modular nature of the assembly-line 
machinery responsible for the production 
of many classes of natural product, 
which allows not only the rapid detection 
of, for example, polyketide synthase 
and NRPS, but also their annotation 
with putative primary structures of 
the expected cluster products based on 
the colinearity of enzymatic modules 
and chemical moieties added to the 
growing compound (e.g., Ref. [41]). The 
antiSMASH software brings together 
many of these individual elements into a 
unified package for the comprehensive 
annotation of secondary metabolite gene 
clusters [42, 43]. Using complete or partial 
genome sequences as input, this software 
detects biosynthetic gene clusters for 
the production of a large collection of 
secondary metabolites, including polyke- 
tides, nonribosomal peptides, terpenes, 
aminoglycosides, aminocoumarins, in- 
dolocarbazoles, lantibiotics, bacteriocins, 
nucleosides, beta-lactams, butyrolactones, 
siderophores, melanins, oligosaccharide 
antibiotics, phenazines, thiopeptides, 
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homo-serine lactones, phosphonates and 
furans, as well as aligning identified gene 
clusters to their closest homologs using 
an integrated version of MultiGeneBlast 
[44] and predicting putative primary 
structures for several groups of complex 
metabolites. The most recent release of 
the software and associated web site [42] 
allows the use of draft genomes and 
metagenomic sequences as input. 

The second challenge — to awaken silent 
and cryptic gene clusters—requires a 
long-term effort of synthetic biology. A 
variety of different tools will be de- 
ployed to this end. For example, Gottelt 
et al. [45] awakened one of the cryp- 
tic gene clusters found in the Strepto- 
myces coelicolor genome by deleting a 
pathway-specific regulator. This gene clus- 
ter had not been identified until the 
genome was sequenced, and was the first 
proof-of-principle example for the awak- 
ening of a cryptic gene cluster. Further 
examples of cluster awakening by disrup- 
tion of repressors or overexpression of 
activators have recently been discussed in 
detail by Aigle and Corre [46]. 

Another strategy of awakening relies 
simply on overexpression in a heterolo- 
gous host. For example, Franke et al. [47] 
detected a conserved cryptic polyketide 
cluster in the genomes of a group of 
pathogenic Burkholderia species. The 
cluster was silent under standard growth 
conditions and unusual in its architecture, 
with many of the enzymatic modules 
incomplete or disrupted, and thioesterase 
domains found in unusual places. The 
overexpression and deletion of a putative 
pathway-specific regulator did not awaken 
the cluster, but the introduction of a 
constitutive promoter resulted in a strong 
yellow pigmentation and the production 
of a new, but very unstable, metabolite. 
Chemical stabilization by methylation 
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allowed the structural elucidation of this 
product (now named burkholderic acid), 
revealing a very unusual biosynthetic 
pathway that involved the incorporation 
of a methionine-derived building block, 
the fusion of two independent polyketide 
chains, and the subsequent formation of 
a furan ring. 

A more generalized strategy, which po- 
tentially is suitable for high-throughput 
studies, was recently introduced by Tanaka 
et al. [48], who showed that various muta- 
tions of the rifampicin resistance gene 
rpoB lead to the activation of a wide range 
of silent or cryptic biosynthetic gene clus- 
ters in several species of actinomycetes, 
the products of which could be detected 
and characterized by metabolic profiling. 
A variety of related methods of increas- 
ing and diversifying secondary metabolite 
production have been discussed by Scher- 
lach and Herweck [49] and Zhu et al. [50]. 
The most generally applicable approach, 
with clear implications for a synthetic biol- 
ogy strategy, seems to be the modulation 
of the epigenetic state of gene clusters. 
This is currently achieved by rather un- 
specific interference with the DNA methy- 
lation system, for example, using DNA 
methyltransferase or histone deacetylase 
inhibitors [51, 52], but a more direct mod- 
ulation of the epigenetic state of a specific 
cluster will be possible for resynthesized 
clusters in the future. 

Various types of external stress, such as 
heat shock or enzyme inhibitors, have also 
been reported as leading to the awaken- 
ing of compound production in specific 
examples. As many bioactive molecules — 
particularly antibiotics-—are expected to 
be involved in inter-species interactions 
(either predatory or mutualistic), the cocul- 
tivation of strains is another powerful 
strategy for the awakening of clusters. A 
recent example was the induction of a 


cryptic terpenoid biosynthesis pathway in 
Aspergillus fumigatus, a pathogenic fungus 
encoding a huge number of PKS and non- 
ribosomal peptide synthetases, but with a 
very limited number of characterized sec- 
ondary metabolites, by cocultivation with 
a soil-dwelling actinomycete bacterium, 
Streptomyces rapamycinicus [53]. Unfortu- 
nately, this approach is limited to the small 
set of cultivatable microbes, and in con- 
trast to the stress-induced or epigenetic 
awakening it is far more reliant on the 
physiological context of the native host. 
However, interestingly, this limitation 
may not be too problematic, as the main 
mechanism of cluster activation in the 
Aspergillus— Streptomyces interaction seems 
to occur via an epigenetic mechanism, 
involving the alteration of fungal gene ex- 
pression by induced histone modification 
in response to the bacterial neighbor [54]. 


5 
Synthetic Biology Approaches for the Rapid 
Generation of Chemical Diversity 


While genome sequences and awakening 
strategies provide an enormous amount 
of novel biosynthetic building blocks for 
a synthetic biology strategy, the main 
advances are expected from the innovative 
combination of these building blocks to 
create chemical novelties. This strand 
of the synthetic biology of secondary 
metabolites relies on the modular nature 
of the synthetic pathways, and builds on 
a rich heritage of traditional approaches, 
most importantly combinatorial biosyn- 
thesis and mutasynthesis. Combinatorial 
biosynthesis, starting with the classic 
studies of McDaniel et al. [55], aims 
at the engineering of novel secondary 
metabolites by the rational recombination 
of the biosynthetic modules contained 
in natural assembly-line biosynthesis 


machinery. Mutasynthesis, on the other 
hand, combines engineered natural 
pathways with exogenous building blocks 
from synthetic chemistry to expand the 
accessible chemical diversity, beginning 
with the groundbreaking investigations of 
Shier et al. as early as 1969 [56]. 

Neither of these strategies is superseded 
by synthetic biology, but are further em- 
powered by the advances in DNA synthesis 
and design that underpin the synthetic 
biology revolution, and synthetic biology 
approaches will play a key role in the 
progress towards a rational design of new 
compounds by combinatorial biosynthe- 
sis. At the same time, these investiga- 
tions provide critical information on the 
mechanistic constraints under which the 
synthetic biology of secondary metabo- 
lite biosynthesis is operating. Important 
examples here are the various instances 
of noncolinear PKS, which stand in the 
way of too-simplistic approaches towards 
recombining biosynthetic modules. A de- 
tailed analysis of the control of the chain 
elongation process in such a system has 
recently been provided by Busch et al. [57] 
for the aureothin pathway of Streptomyces 
thioluteus. 

Wong and Khosla [58] have provided 
a comprehensive review of the combi- 
natorial biosynthesis of polyketides, fo- 
cusing on the products of multimodular 
(assembly-line) PKS and highlighting the 
major challenges that currently prevent 
the full promises of combinatorial biosyn- 
thesis to be realized: 


e The limitation of functionally fully char- 
acterized gene clusters and enzymatic 
activities. 

e Enzymological difficulties in ensuring 
effective substrate channeling along 
a combinatorially reshuffled assembly 
line and the necessary relaxed substrate 
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specificity of the enzymes that are 
reshuffled. 

e The lack of methods for rapidly assem- 
bling the large gene clusters necessary 
for combinatorial biosynthesis. 


The latter challenge in particular is at 
the center of the remit of synthetic biol- 
ogy, and the review closes appropriately 
by outlining a vision for combinatorial 
biosynthesis that could be taken straight 
from a manual for synthetic biology and 
the BioBrick repository: 


“In the future, large-scale combi- 
natorial biosynthesis will require 
a catalogue of validated domains, 
didomains, modules, and linkers 
capable of performing the spec- 
trum of catalytic chemistry that 
is observed in nature’s assembly 
lines. Their salient characteristics 
will be well documented in order 
to allow the engineer to rationally 
choose the right components for 
rationally designing a chimeric 
assembly line. The catalogue will 
be accompanied by an instruction 
manual, and reference data from 
a set of control experiments.”’ 


The first steps in this direction are al- 
ready being taken, an example being the re- 
cent review of lantibiotic drug discovery by 
Montalban-Lopez et al. [59], which empha- 
sizes the roles of synthetic biology as not 
only awakening cryptic clusters but also 
the rapid creation of chimeric compounds 
with new and improved properties. 

The modular nature of the biosyn- 
thetic pathways, which is exploited in 
combinatorial biosynthesis, also makes 
antibiotics production a promising target 
for multiplexing approaches to genome 
engineering, such as Multiplex Automated 
Genomic Engineering (MAGE), which can 
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target multiple loci to produce a huge 
number of strains with defined combi- 
natorial genetic diversity [60]. 

With combinatorial biosynthesis thus 
moving quickly to become a core activity 
of the synthetic biology of secondary 
metabolism, mutasynthesis is not far 
behind [61]. In this case, attention 
is focused not on manipulating the 
synthetic pathway directly but rather 
on changing its cellular context in such 
a way that a critical building block is 
no longer available, forcing the cell 
to incorporate an exogenous synthetic 
analog with high efficiency. For example, 
Gregory et al. [62] used a Streptomyces 
hygroscopicus strain that was unable to 
synthesize the native starter unit, 4,5- 
dihydroxyclyclohex-1-enecarboxylic acid, 
for the anticancer drug rapamycin, and 
created a library of novel “rapalogs’” by 
providing variant starter units that differed 
in the position and stereochemistry of 
hydroxylation, and also in the presence 
or absence of double bonds and variant 
cycle sizes. In this case, the mutant 
strain was found by serendipity, during 
attempts to create a strain that was 
deficient in the final processing of the end 
compound. However, with an increasing 
understanding of the molecular biology 
of the biosynthetic pathways, a far more 
targeted strain design is now possible 
(see, e.g., Refs [63, 64]) and the number 
of reports employing the mutasynthesis 
approach has skyrocketed. 

As the success of mutasynthesis de- 
pends on a sufficiently relaxed substrate 
specificity of all subsequent biosynthetic 
steps following incorporation of the ex- 
ogenous building block, there is an ob- 
vious role for synthetic biology in the 
optimization of the approach for a wider 
coverage of chemical space. Inspiration 
for these investigations will come, for 


instance, from the classic example, namely 
the creation of a hybrid PKS, combining 
the extender units of the erythromycin 
biosynthesis PKS with the much more 
substrate-tolerant loading module of the 
avermectin PKS, to allow the generation 
of a library of erythromycin analogs [65]. 
More interesting, however, will be the 
more ambitious strategy of combining 
mutasynthesis with the supply of variant 
precursors by synthetic biology, instead of 
synthetic chemistry, either in the same 
strain or in a synthetic microbial commu- 
nity. 

This will perhaps be most relevant 
for diversification of the tailoring steps 
of antibiotics biosynthesis, rather than 
for mutasynthesis of the core scaffold 
of the structure. For example, Thaker 
and Wright [66] discuss synthetic biology 
strategies for increasing the diversity of 
glycopeptides by manipulation of the 
tailoring reactions, while Giessen and 
Marahiel [67] provide an overview of di- 
versification of nonribosomal peptides by 
synthetic biology. The ability to engineer 
novel enzymatic functionalities further 
expands the range of accessible chemistry; 
for example, Felnagle et al. [68] discuss the 
construction of new recursive biosynthetic 
pathways (analogous to those for fatty 
acids, polyketides, and isoprenoids), using 
non-recursive enzymes as the starting 
point. 


6 
Engineering of Versatile Chassis for the 
Synthetic Biology of Antibiotics 


The synthetic biology of antibiotics prod- 
ucts depends not only on the abil- 
ity to assemble new pathways for in- 
creased chemical diversity, but also on the 
availability of suitable production hosts 
(chassis). 


The first challenge in creating a suitable 
chassis is the background of secondary 
metabolites typically produced by most 
microbes. This is particularly problem- 
atic when actinomycetes are considered 
as overexpression hosts. Although these 
may be preadapted to producing a wide va- 
riety of chemicals at high levels, they also 
show a particularly rich repertoire of native 
secondary metabolites that would interfere 
with the production, detection, and purifi- 
cation of heterologous compounds. Many 
attempts have therefore been made to pro- 
duce genome-minimized actinomycetes 
with a “clean” chemical background. 

Zhou et al. [69] created a series of 
derivatives of S. coelicolor in which 
they sequentially deleted all 10 pre- 
dicted PKS and NRPS gene clusters, 
plus an additional 900kb subtelomeric 
sequence (to create a more compact 
version of the genome). Clusters were 
deleted by the polymerase chain reaction 
(PCR)-targeting of cosmids and subse- 
quent removal of the selection marker 
(contained in an FRT-aac(3)IV-FRT cas- 
sette) at each step by the expression of 
FLP-recombinase. The resulting “super- 
host” is a large step forward compared 
to an earlier version, in which only the 
four most highly active secondary metabo- 
lite gene clusters (for actinorhodin, prodi- 
giosins, calcium-dependent antibiotic, and 
ycpk/coelimycin) were deleted [70]. In 
the latter strain, which also contained 
point mutations in rpoB and rpsL for the 
pleiotropic induction of secondary metabo- 
lite production, heterologous expression of 
various antibiotic biosynthesis pathways 
led to substantially increased production 
levels compared to the parental strain. 

Similarly far-reaching genome mini- 
mization has been carried out for Strep- 
tomyces avermitilis, an industrial producer 
of secondary metabolites [71]. Systematic 
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deletion of nonessential genes — in partic- 
ular from the subtelomeric regions — by 
site-specific recombination resulted in a 
strain that contained only about 82% of 
the parental genome and did not produce 
any of major the secondary metabolites 
seen in the wild-type. The strain also had 
lost more than 80 (78%) of the wild-type 
transposase genes and a large part of 
the associated insertion sequences, which 
should substantially increase the genome 
stability of the strain. The genome-reduced 
strained showed an efficient production of 
a wide range of heterologous compounds 
which, in the case of streptomycin and 
cephamycin C, exceeded that of the native 
producers. 

In addition to a reduced background 
of endogenous secondary metabolites, an 
efficient chassis will also require an opti- 
mized primary metabolism for efficient 
precursor supply [72]. The importance 
of metabolic modeling for the optimiza- 
tion of secondary metabolite production 
hosts has been illustrated repeatedly, 
for example, in the generalized overpro- 
duction of pigmented antibiotics (acti- 
norhodin and prodigiosins) in S. coeli- 
color upon the deletion of a phospho- 
fructokinase isoform which, according to 
genome-scale modeling, was predicted to 
lead to an increased flux through the 
pentose phosphate pathway and toward 
the pigmented antibiotics [73]. A detailed 
metabolic model was also used to iden- 
tify limiting reactions in the biosynthesis 
of daptomycin by Streptomyces roseosporus, 
and overexpression of three selected en- 
zymes resulted in increased production 
levels together and individually [74]. Al- 
though the increases were relatively sub- 
tle -a maximum improvement of 43.2% 
above the parental strain -the accuracy 
of the model-driven target selection was 
very encouraging. This was confirmed in 
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a second study, aiming at the improve- 
ment of FK506 production in Strepto- 
myces tsukubaensis [74]. Again, the ma- 
nipulation of target genes identified by 
genome-scale modeling by overexpression 
or gene deletion reliably increased the 
production levels. Interestingly, the im- 
provement was equally modest (47% above 
the parental strain), indicating that funda- 
mental constraints might be operating to 
limit the production of certain compound 
classes. 

A different type of optimized precursor 
supply is required for strains suitable for 
mutasynthesis. Kendrew et al. [75] have 
recently presented an optimized Strepto- 
myces rapamycinicus strain for the indus- 
trial production of rapalogs, which is no 
longer able to synthesize the starter acid, 
forcing the cells to incorporate the exoge- 
nous analogs supplied. The same strain 
is also optimized for robust industrial 
rapamycin production, and furthermore 
overexpresses the enzymes responsible for 
the late steps in the biosynthetic path- 
way. These enzymes, due to substrate 
specificity constraints, are inefficient in 
catalyzing the production of the rapalogs. 

Clearly, a major advantage of synthetic 
biology is the ability to move pathways 
out of their natural context, with few 
limitations. Within actinomycetes, an al- 
ternative chassis might be provided by 
thermophilic Streptomyces species, which 
show a much faster growth than current 
industrial strains [76]. Non-actinomycete 
hosts for complex secondary metabolites 
have recently been widely investigated; 
the first organism of choice is probably 
E. coli, and this species has indeed been 
tested repeatedly. Recently, examples have 
included biosynthesis of the polyketide 
erythromycin A [77], of the type II lan- 
tibiotic lichenicidin [78], of derivatives of 
the aminoglycoside ribostamycin [76], of 


the TDP-1-megosamine megalomicin [79], 
and of the polyketide 6-deoxyerythronolide 
B [80]. To illustrate the scale of these ef- 
forts, the overproduction of erythromycin 
A not only required heterologous expres- 
sion of a 20-enzyme biosynthetic pathway 
but also necessitated an additional manip- 
ulation to enable the biosynthesis of two 
essential precursors not normally found 
in E. coli. In total, the resulting strain 
required the engineering of 26 different 
genes, both native and heterologous. 

A wide variety of other novel chassis for 
complex antibiotics has also been used in 
proof-of-concept studies. A classic example 
of this is a production of the relatively 
simple polyketide 6-methylsalicylic acid 
in Saccharomyces cerevisiae (and E. coli) 
by overexpression of the corresponding 
PKS and a heterologous phosphopanteth- 
einyl transferase to convert the synthase 
to its holo form [81]. More recent cases 
have included the production of miltira- 
diene, a key intermediate of tanshinone 
biosynthesis, in metabolically engineered 
yeast, S. cerevisiae, establishing an alterna- 
tive source for this group of plant-derived 
pharmaceuticals [82] and production of 
the polyketide oxytetracycline in Myxococ- 
cus xanthus [83]. Further discussions of 
various other groups of microbes as poten- 
tial chassis for industrial synthetic biology 
are available, including those of Wijffels 
et al. ([84]; cyanobacteria and microalgae), 
Vogl et al. ([85]; Pichia pastoris), Liu et al. 
([86]; Bacillus spp.), and Poblete-Castro 
et al. ([87]; Pseudomonas putida and related 
species). Not all of these are immedi- 
ately amenable as hosts for industry-scale 
fermentations of complex compounds by 
large enzymatic machinery, and Zhu et al. 
[88] discuss the synthetic biology strate- 
gies that can be employed to increase the 
robustness of engineered microbes for in- 
dustrial processes. 


One particular group of hosts which de- 
serves at least a passing mention is that of 
plants. While they are far less tractable for 
genetic engineering at the scale required 
for synthetic biology, plants have the po- 
tentially important advantage of providing 
their own carbon source (via photosynthe- 
sis) and, in many cases, being evolutionar- 
ily optimized for high levels of secondary 
metabolite production in specialized or- 
gans and organelles. While many of the 
early synthetic biology proof-of-principle 
studies were inspired by moving plant 
pathways into microbial hosts (e.g., the 
famous case of artemisinin [38]), it is not 
impossible that the opposite movement 
will also become relevant. A glimpse of the 
potential applications and challenges of 
secondary metabolite production in plants 
is provided by Zurbriggen et al. [89] and 
Pollier et al. [90]. 


7 
Tools for the Future 


Synthetic biology is a research area driven 
by technological and conceptual advances. 
Not all of these have already been applied 
to the engineering of biosynthetic path- 
ways, but some will obviously have an 
impact for antibiotics production in the 
near future. 

Synthetic microbial consortia are a 
good example of this [91]. For example, 
Minty et al. [92] have recently described a 
consortium between a cellulolytic fungus, 
Trichoderma reesei, and a_ bacterium 
engineered for isobutanol production, E. 
coli. Manipulation of the seeding ratio 
of the two species allowed fine-tuning 
of the biosynthetic performance, and 
quantitative ecological modeling was 
used to ensure robustness of the 
two-species consortium. Without nutrient 
supplementation, the system achieved 
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yielded up to 62% of the theoretical max- 
imum directly from cellulosic biomass, 
such as corn stover. For secondary 
metabolite production, such a consortium 
approach would not only allow economic 
exploitation of a cheap carbon source, but 
would also have important advantages for 
the modular development of production 
systems, where precursor supply and 
end compound biosynthesis could be 
independently designed and optimized [4]. 

Another area of rapid developments is 
the field of engineered organelles and 
designed compartmentalization [93, 94]. 
Here, antibiotics production can bene- 
fit in multiple ways: compartmentalizing 
the biosynthetic pathways not only has 
the potential of improving catalytic effi- 
ciency by substrate channeling and in- 
creased local concentrations, but also can 
avoid possible negative effects of toxic 
intermediates or end products. In bac- 
teria, the most promising approach is 
the use of proteinaceous microcompart- 
ments for pathway containment based, 
for example, on the ethanolamine utiliza- 
tion (Eut) or the 1,2-propanediol utiliza- 
tion (Pdu) microcompartments, both of 
which have evolved to protect bacterial 
cells from toxic metabolic intermediates 
[95, 96]. In addition to the closed com- 
partments provided by these organelles, 
the optimization of biosynthetic path- 
ways — especially the assembly lines for 
complex secondary metabolites — can ben- 
efit from open scaffolding of the pathway 
enzymes, increasing their spatial prox- 
imity, optimizing their local stoichiom- 
etry, and minimizing the diffusion of 
toxic intermediates. The scaffolds can be 
based on proteins or nucleotides, using 
high-affinity /high-specificity binding do- 
mains for assembly. For several metabolic 
pathways, such as the mevalonate pathway 
and the glucaric acid pathway, scaffolding 
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has been shown to lead to dramatic im- 
provements in yield [97, 98]. 

In addition to the enzymatic building 
blocks, the synthetic biology of complex 
biosynthetic pathways also depends on 
the availability of libraries of regulatory 
elements. Temme et al. [99] have in- 
troduced a system based on T7 RNA 
polymerases that allows the assembly of 
complex orthogonal regulatory circuits in 
bacteria. In combination with quantita- 
tively characterized libraries of promoters 
and terminators, this allows the engineer- 
ing of highly tunable expression control 
systems for multicistronic gene clusters. 
A synthetic promoter library specifically 
for actinomycetes (the main producers of 
antibiotics) has been developed and char- 
acterized by Sieg] et al. [100], who reported 
reproducible promoter strengths between 
2% and 319% relative to the strong natural 
reference promoter ermEp1 from Saccha- 
ropolyspora erythraea. 

Similarly, libraries of ribosome bind- 
ing sites have been constructed to allow 
the additional optimization of synthetic 
circuits at the level of translation [101]. 
New systems for regulating gene expres- 
sion, for example, by small regulatory 
RNAs [102], complement this collection 
of regulatory tools. For the major an- 
tibiotics producers, the regulatory effects 
of noncoding RNA were first described 
by D’Alia et al. [103] for a cis-encoded 
antisense RNA in the glutamine syn- 
thetase I gene in S. coelicolor, followed by 
characterization of the resulting synthetic 
metabolic switch by detailed metabolomic 
profiling using high-resolution mass spec- 
trometry [104]. This was complemented 
by the demonstration of a functional syn- 
thetic RNA silencer in the same species 
by Uguru et al. [105], targeting the 
biosynthesis of the pigmented antibiotic 
actinorhodin. 


One challenge that still requires special 
attention is the (lack of) quantitative 
predictability of control circuits com- 
posed from the elements in regulatory 
parts libraries. Kosuri et al. [106] have 
shown that such systems, assembled from 
well-characterized promoters and ribo- 
some binding sites, in many cases deviate 
far from the predicted behavior at the RNA 
and protein level, due to “idiosyncratic 
interactions and context effects.” These 
authors suggested that large-scale screen- 
ing efforts based on libraries, rather than 
quantitative characterization and_ stan- 
dardization of the building blocks, would 
be the most promising strategy to deal with 
this lack of predictability. The introduction 
of “insulator” elements that reduce inter- 
ference and context dependence between 
the elements of synthetic regulatory cir- 
cuits [107] is another attempt at increasing 
the predictability of regulatory systems. 

Genetically encoded biosensors that can 
detect pathway intermediates and end 
products are another important compo- 
nent of the emerging toolkit of pathway 
engineering [108]. Once these can be re- 
producibly constructed, they will allow the 
instantaneous monitoring of the metabolic 
state of a production system and the ad- 
justment of enzyme levels to match cur- 
rent requirements. One promising version 
of such biosensors would be to use ri- 
boswitches based on RNA aptamers [109], 
which can bind to selected metabolites and 
in response modulate the expression level 
of specific target genes. 

Computational modeling, at the level of 
individual building blocks such as RNA 
devices [110, 111] or enzymes [112, 113], 
or on the whole-pathway and whole-cell 
scale [114-116], also play a major role 
in driving the rational engineering of 
biosynthetic systems for antibiotics pro- 
duction [39]. These allow the design of 


new building blocks with desirable (or nec- 
essary) functionality from scratch, their 
assembly into novel pathways using ret- 
rosynthesis concepts [117], as well as the 
in silico identification of biosynthetic bot- 
tlenecks or unwanted side reactions, in 
particular when combined with diagnos- 
tic debugging tools such as metabolomics 
[118, 119}. 

Of course, the rapid generation of large 
libraries of huge biosynthetic gene clus- 
ters will also depend critically on advances 
in DNA synthesis and assembly method- 
ologies. Particularly relevant examples in 
the case of antibiotics biosynthesis are the 
recently developed methods for the tar- 
geted amplification of genetic loci, which 
can be used to increase the yield of het- 
erologous pathways [120] and have also 
been shown to increase the yield of na- 
tive antibiotics [121]. These systems, which 
are plasmid-free and do not rely on an- 
tibiotic selection, are also important for 
the transition of synthetic biology from 
the research laboratory to the produc- 
tion environment. This is also the case 
for stable expression plasmid based on 
toxin—antitoxin systems, which have been 
available for some time but have only re- 
cently been developed for the important 
antibiotics producers of the genus Strepto- 
myces [122]. While antibiotics production 
by synthetic biology will, for the fore- 
seeable future, be restricted to contained 
reactor systems, its widespread application 
will nonetheless benefit from considering 
the introduction of engineered intrinsic 
biocontainment mechanisms, as reviewed 
by Moe-Behrens et al. [123]. 
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